Hacker News new | past | comments | ask | show | jobs | submit login
Beware the data science pin factory: The power of the data science generalist (stitchfix.com)
195 points by ericcolson 15 days ago | hide | past | web | favorite | 69 comments

I really wish hiring managers read this. I am a data generalist, and have had no traction with obtaining even an interview for a data science job. I’ve setup a private JupyterHub where I run python ETL, interactive models, and dashboards. I deployed Metabase several times and have written hundreds of SQL queries. I’ve used Tableau with gigantic datasets. I built a front end serverless analytics pipeline from scratch with AWS that handles 30M events/mo. I've demonstrably grown revenue and margins in multiple contexts with my data products. I’m working on making a fully dynamic frontend for content recommendations. I have self-taught all of these skills in the past 3 years after a decade in sales, marketing, and entrepreneurship. What I haven’t done: a CS/math degree (mine was music), graduate work, or tech work at a household name. Lived in the Bay Area. Gotten an interview for any data job. Sigh.

None of what you describe is something I'd hire into a data science role specifically. Some of those skills are skills I'd expect a data scientist to have (e.g. SQL skills). A data scientist in this context has to have an understanding of basic statistics generally: hypothesis testing, modelling techniques and their applications, and performance tuning and evaluation. They also need to understand how to devise and run experiments that collect and make use of data in practice (i.e. "real world" data). I wouldn't necessarily expect a candidate to be able to derive the formulas involved, but it would be the odd candidate who truly grasped the nuances who could not do so.

The one thing I see kind of missing is a math background or at least a project proving that that is in your skillset (recommendations sounds like it could fit this). There are a lot of people with a similar background to you and normally those are in "business intelligence/analytics" or "data engineering" where they are mostly writing sql queries and interacting with dashboards/OLAP cubes or setting up those dashboards/cubes.

That's perfectly fine but it's not what traditionally is referred to as data science. I'm actually quite annoyed at what has been happening to the term data science lately - it's supposed to be some stats-heavy/applied-AI role but a lot of companies hiring "data scientists" are really just hiring SQL jockeys.

Personally I've done both data science and data infrastructure and I like infrastructure a lot more anyway. And it sounds like you are somewhat qualified for that with some of your pipeline work (although big data experience is also important). A LOT of data science departments have no idea what type of business value they are supposed to be adding, are doing shitty boring work with glorified titles, or are improperly integrated with the company at large (bad productionizing processes, poor data infrastructure). There's always going to be a need for data infrastructure but the "data science" hype is going to fade once all the shitty data departments cut the fat.

Have you tried paid services/consulting arms of software or cloud companies? Teams that bill customers at an hourly rate? They generally look for generalists who can help customers tackle problems at different levels of the stack. They aren't looking for PhDs in statistics.

When I interview people who have your type of background, I tend to get confused by what exactly it is the person wants to do (Analyze Data? Build an Analytics Pipeline/Architecture? Write Software/Services? Be an Analytics IT person?). Sometimes they have to talk about all the "cool" stuff they've done and lose focus on what they bring to the role. I also become skeptical because it's really easy nowadays to follow a few tutorials on 50 different things and then boast about how you did it all yourself.

Even reading your comment, you don't sound like somebody who wants to analyze data.

As far as being a generalist, I definitely agree that it's good to have somebody with skills in ETL, analyzing data, and maybe building a software service. But what happens is that all those things happen at different speeds and then people get crushed. You're asked to investigate some data quickly over 3 days, but suddenly the software service you built is having issues and you need 2 weeks to dig in and fix it, and also your ETL job is overloading the server you need to fix it yesterday but you need help from a somebody else to figure it out. Oh, and that Metabase thing you installed is broken and the VP was using it and has a big demo tomorrow.

> I tend to get confused by what exactly it is the person wants to do (Analyze Data? Build an Analytics Pipeline/Architecture? Write Software/Services?

But this is the crux of the job-seeker's dilemma. If he/she is specific about their interests when speaking to an interviewer, they might get a response like "well, we're really looking for someone whose operational focus is [something else]".

And if they're not super-specific (I doubt anyone does data analysis exclusively without any other involvement in the project), but instead attempt to give examples where they had demonstrable impact working across a number of domains, you might hear a response like this:

> Even reading your comment, you don't sound like somebody who wants to analyze data.

I don't think a person has to be super specific or say "I am interested in X and Y." What they do need is consistency in a resume so the reviewer/interviewer can evaluate what their primary and secondary focus areas are. Sometimes you have to leave stuff out. For example, Data Scientists typically don't deploy and maintain a Data Science stack unless it's a really small company. And a Data Science Infrastructure person at a bigger company probably isn't analyzing data unless they are just playing around to validate their stack.

But if you have a resume (or say this during an interview) that gives equal weight to the data analysis and the stack deployment, it's just confusing to the person reading it. Especially in Data Science, which already confusing from a skillset perspective. Lots of resumes look like the applicants just thought 10 things with minimal overlap were cool and decided to put them on their resume.

Even if you did work at a 5 person startup and had the unoffical title of "Data Scientist, Data Engineer, Data DevOps, DB Admin, and Chief Data Officer" I'd recommend you downplay some of those based on the jobs you are applying for. Figure out what is essential and what is +1

>what exactly it is the person wants to do (Analyze Data? Build an Analytics Pipeline/Architecture? Write Software/Services? Be an Analytics IT person?).

But you realize that the vast majority of businesses in North America need someone to solve all of those problems. They aren't going to hire and cna't afford an experienced data team of specialists.

This is the point of the article. Getting the 80% is far more valuable than having some PhD optimizing the hell out of features. Silicon Valley tends to overthink things.

>But you realize that the vast majority of businesses in North America need someone to solve all of those problems. They aren't going to hire and cna't afford an experienced data team of specialists.

The vast majority of business don't need Data Scientists. They need a BI person with SQL skills and some of the skills of a Database Admin. What most companies really need is a good set of Dashboards and clean data to feed it. This enables the business people to get the information/visibility they need an make decisions.

Also, most businesses should not be building analytic services and deploying them - they should be paying for a good product with a cloud or easy on-prem install and getting support from the company that sells the product. A few licenses of a good BI product are a lot cheaper than a Data Scientist.

I don't see this as a bad thing - it's a lifecycle thing. I absolutely would want someone like the GP to start my data team from day one - there is a lot to build and much to hang together.

When I am up and running I don't want yet another generalist - or rather I will happily take one, I just will put them in a box making pins.

Perhaps the GP will do better at the consulting level - or even some level of productise consulting - and out of the box product

I'm a hiring manager who shares this view. I see three distinct but overlapping skillsets with creating machine learning - as they mention in the article: data engineering, data science, ML engineering. I would never hire a person who only had one of those skillsets. I prefer all three, but can settle for two depending on the situation.

The trouble you might be having in getting an interview is probably partly to do with your background and likely also that those job postings get A LOT of submissions. Other hiring managers in my department as well as myself have found we get 10x more submissions for DS/ML positions than software dev positions. In general it's a really unrefined and new job skills that anyone and everyone who's taken a coursera course in regression or clustering will apply.

> I built a front end serverless analytics pipeline from scratch with AWS that handles 30M events/mo. I've demonstrably grown revenue and margins in multiple contexts with my data products

When I'm hiring I care most about wins. These are wins. When I read 30M events / month I want to hear more.

The rest is fluff IMO and things that I'd expect you to play around with while you're self learning. Also, most hiring managers don't care about what you did 10 years ago in sales if you're applying for a data science role. It might be icing on the cake you can share if you get into a conversation about sales or marketing, but otherwise it can feel off topic.

I'd slim down your resume to focus on these two wins (plus any recent experience building or leading teams) and stay later focused on recent data science related work in production. That sounds like a good enough resume to get an interview at most companies. Good luck!

>>When I'm hiring I care most about wins.

Losses can be just as important, IME. “I tried to do X, tried several methods, each one failed due to...” would be an interesting conversation to have with an interviewee.

Sounds like full stack data visualisation, rather than data science. You’re an engineer applying for science jobs.

Check out the series by Jeff Leek, Brian Caffo and Roger D. Peng, “A Crash Course in Data Science.”

Hope this helps, I can be pretty clueless sometimes so you probably already know all the mathy bits.

I think you've got it flipped. A CS/math degree, and/or STEM graduate work, are very strong indicators of generalist skills and exposure to breadth. And graduate work is a strong signal about someone's ability to learn and deal with exploratory/unknown problems. Whereas the things you've listed are actually more specialized.

This certainly doesn't excuse hiring managers from lazily filtering out candidates who don't have these things. But these are strong signals.

Depends where you're applying, right? If you don't know what hyper-parameters, regularization, or cross-validation are and how they affect your work, then some jobs just won't make sense until you're able to talk about those coherently.

The skills you have are useful. I know we wouldn't hire you in for a data science role on my team. I could imagine many other places they're useful, though, so perhaps it's the places you're searching.

The skills you described are those of a data scientist/engineer hybrid. Have you tried clearly branding yourself as a data engineer? There’s a lot fewer of those, and the job is overlapping to the nearest understanding of a hiring manager or non tech person.

Please post some contact information. No idea what your geography is but I am very interested in chatting with you about some roles that I have that may be very well aligned to the skills and interests you are describing.

Are you assuming people with technical degrees don't have these same skills?

You can also train entry level for a good amount less too.

You should put some contact info in your bio

If you're interested in relocating to Amsterdam, send an email to earl at apolloagriculture dot com.

Have you tried data engineering jobs rather than data scientist?

Can you create a public demo site to showcase your multi-disciplinary skill set? Perhaps make it specific to sales/marketing, where you have domain knowledge.

How did you teach yourself all that?

You sound like you'd be a great data engineer.

It's not you, it's capitalism. There just aren't enough (well paying, decent) jobs to go around. I'm in the same boat. Welcome aboard matey.

We’re hiring.

Email me: mark at dotscience dot com


Personal attacks are not ok here, and we've banned this account.

If you don't want to be banned, you're welcome to email hn@ycombinator.com and give us reason to believe that you'll follow the rules in the future.


This would be unnecessarily harsh even if it were true.

It’s not, though. I know a data scientist who studied Arabic language. And software engineers who studied music.


There are plenty of studious music majors, plenty of liberal arts majors capable of catching up with their C.S. major peers with a year or so of dedicated study, and plenty of people who party hard and are still qualified and capable to do the work that you do.

Your comment reads more like sour grapes about your own college experience than anything else ;)

Send me your CV. If you're legit and resourceful enough to get in touch with me, I'll get you a job or tell you how to get one at least.

I’m not sure this is entirely true. The author is arguing for full stack scientists, and I prefer those people, but they’re hard to find, and even then you don’t necessarily want them doing everything. Worse yet, if you put someone in a full stack position, and they’re not already full stack, you need to budget a lot of mentoring, because if you don’t, you’re going to get a big pile of unmaintainable code.

The author kind of builds a strawman of super specialized data scientists that constantly throw code over the wall to someone else. That doesn’t work, and you simply can’t do that unless your headcount is in the thousands. You have to have people that can productionize their work. At the same time, he’s arguing that scientists should should be maintaining their own data infrastructure, but that’s not good either.

The best advice I was given was to hire people either to make you smarter, or to make you stronger/faster. You hire data scientists and ML experts to make you smarter. They should be working on problems that you can’t solve today. Infrastructure on the other hand, isn’t your product. It’s overhead. It’s a tool. Comparatively, it’s easier to hire people to build and maintain your infrastructure. Hire people to do that. All the time your scientists are dealing with infrastructure, is time they could be doing useful work.

All that said, know when you should just shove the infrapeople aside and do it yourself.

Infrastructure isn't your product. Why build an infrastructure instead of buying it?

Not OP, but given the context, it seems OP is using infrastructure to mean "all prerequisites to doing ML/data analysis work."

Some of that (e.g. datawarehousing, etc.) is easier to outsource; other parts (data acquisition from your product, ETL design, etc.) are necessarily bespoke to your company an thus not readily "buyable." I understand OP to be arguing roughly "you can get a good DBA for much cheaper than you can get a good ML Engineer (much less a good ML Engineer who's ALSO a good DBA), so there's no sense in making Database management part of the Data Scientist role."

You have correctly understood what I am saying.

A very good article, but I think that there is a missing concept - which is organisational maturity. In a fully mature data driven organisation (like... errm Google I guess - reading Jeff Deans papers anyway) there is a well developed data fabric, polished processes for providing credentials and authority, right sized resourcing pools and also substantial diversity of specialisation coupled with experience and domain insight. Specialists can flourish and deliver value out of proportion to their costs. In other, less developed, organisations there's no chance this will happen and specialists will be left floundering looking for the setting in which they can do their thang.

> A very good article, but I think that there is a missing concept - which is organisational maturity.

Maturity and also scale - I suppose a small or even one-man shop requiring a generalist could be mature. Once you get to a certain size specialization happens automatically.

I agree that a one man shop can be "mature"; but there are many very large scale operations that have cultures that absolutely preclude speciality.

Article's sentiments are also true for Business Intelligence. The most effective (I deliberately used work effective) BI developers have the following qualities interested in the business, able to chat to clients (emotional intelligence) and also able to code. The best BI people end up being generalists. Talkative nerds who can converse with business types and from the business end, you get the business people who are genuinely curious and willing to learn some SQL.

Being able to communicate is key in BI because this enables you to focus on the right business problems.

I agree and disagree with this post. I do think data scientists need to be better at data processing and do more of it. But I still think you do need a separation of labor between people setting up pipelines and people building models from the data. The real issue is that there are a lot of data science departments where they wittle away at their models in some notebook and then they're "done" once the notebook is showing the right metrics. Data scientists should be writing their models from the beginning so that they can productionize them once they are finished. There shouldn't be frequent hand off events requiring lots of communication between DS, pipelines, and data engineering teams, there should be an integration process set up so the flow of work continues to function without intervention.

Interestingly, the article doesn’t talk about the scale of production and its effects on productivity. When you produce lots of pins, division of labor is a known way to increase productivity.

A data science generalist may work fine for a small data shop but as you grow and expand data science in your organization, we know the next step to increase productivity involves specialization (AKA division of labor). It happens not just in data science, but in all business functions and with all business roles.

Marketing, Sales, Finance, Engineering, Operations - every business function uses specialization to get productivity gains. So while generalists may work for you if you’re a small business or a large business spinning up a new business function, specialization is a proven economic tool for productivity gains as you grow.

Interestingly, as a business function grows, the communication costs and the ensuing delays increase and this is a known side-effect of specialization within that business function. This doesn’t mean one throws away specialization and runs to the other extreme of the spectrum with their use of generalists. There’s a tradeoff organizations make here and there’s been a lot of experimentation done in this space like - Amazon's two-pizza teams (https://zurb.com/word/two-pizza-team), Spotify’s Squads, etc - these organizational structures are not universally applicable but they’re interesting developments to look at.

Shameless Plug (on current state of data science market) - https://medium.com/open-factory/state-of-the-m-art-big-data-...

I generally agree with this article, and I am, and continue to aspire to be, a strong generalist data scientist. However, I do still enjoy/need to have 1 or 2 really really strong quants/statistician types on my team, since they are able to solve certain problems at a level of depth I can't reach. However, if they aren't supported by generalists, they also struggle to make impact.

Yes, indeed, the main issue TFA missed out on is comparative advantage.

Theoretically speaking...it's much more efficient to have the specialists doing what they do best instead of trying to learn how to optimize SQL queries or whatever.

This sounds suspiciously like the battle software developers have been waging with people who want to run software development in a manufacturing model. The battle itself really sucks the love of making something right out of you.

The author points this out at the end but I want to highlight it. Adam Smith also said that division of labor makes a person "as stupid and ignorant" as a person can become. https://www.pitt.edu/~syd/ASIND.html

I'm Financial Analyst, CPA, CIA, CTA, Statistician, Expert System Developer.

I independently developed a financial analysis expert system, with a strong ability to innovate and execute.

All my expertise is entirely self-taught.

My Project: https://github.com/linpengcheng/fa

My technology Blog: https://github.com/linpengcheng/PurefunctionPipelineDataflow

@pirocks, Don't you think this is a case of data science generalist creating good products? What kind of psychology makes you give a downvote? Why delete your comments again!


Don't you think this is a case of data science generalist creating good products?

Wow, this is a cool "interactive whitepaper" website :)


This works in environments where infrastructure can support it. It can be downright blissful!

This article is terrible. You can’t make a case by putting a bunch of unsupported assertions into section-heading fonts and then just filling in paragraphs.

This reads like a desperate business person wrote it, who wishes that one full-stack set of drives made sense and coexisted in a single person to make that labor cheaper and more commidity, despite the reality that it’s simply not true.

The person who spent the time to master web service frameworks, query languages and product engineering necessarily did not also master professional level knowledge of deep learning or MCMC sampling or natural language processing.

The two types of people need to coexist and work symbiotically, but it’s just asinine wishful thinking to pretend like they are the same person, let alone to write a baseless essay full of assertions that if they aren’t the same person it somehow results in first principles economic inefficiency.

You'll have to do better than an ad hominem + "the opposite is true." Author is Chief Algorithms Officer at Stitchfix and former VP Data Science & Engineering at Netflix.

No, sorry. Argument from authority doesn’t mean the original article has a cogent point.

There’s no burden on anyone to refute anything from this piece, as the piece itself has not met any basic requirement of presenting facts or evidence in the first place.

It’s merely a matter of fact to point out this deficiency of the article. The premises of the article could still be accurate (though I think that is fleetingly unlikely), but even if so, this article does not justify any of those claims, so nobody could know one way or the other from this article. Again, this is just a matter of observation of the justifications given.

This author would personally find it more convenient if the skillset of data scientists and data platform engineers coexisted in one person who also happened to have the drive to undertake employment spanning all those skill sets, and wouldn’t become unhappy if the employer did not respect specializations. So this author has decided to read tea leaves out of economic principles and superimpose this wish as if it was justified by some first principles analysis.

In fact, this wishful thinking seems exactly in line with the flawed perspective that executives or director level employees will have. They don’t want to have to care about motivation and intellectual curiosity required to keep certain kinds of knowledge workers happy & productive, and spend lots of time trying to justify how their business units embody corporate platitudes about customer-driven passion. It’s quite easy to see why they would fall victim to this sort of naive wishful thinking. It’s quite similar to CTOs getting suckered by turn-key consulting solutions. It’s not even surprising that VPs & C-suite executives would be very wrong about this type of work.

> Author is Chief Algorithms Officer at Stitchfix and former VP Data Science & Engineering at Netflix.

This is what I was referring to when I said argument from authority :)

From https://en.wikipedia.org/wiki/Argument_from_authority : a fallacy to cite an authority on the discussed topic as the primary means of supporting an argument

You'll have to do better than an argumentum ab auctoritate (aka argument from authority)

PS: Thanks for pointing out ad hominem :)

I like Ray Dalio's principle of "believability."[1] All else being equal, it's reasonable to weight an experienced person's input more than that of a less experienced person.[2]


[2] I don't know the experience of the person I was responding to, so I'm making an assumption.

It is rare that deep learning performs better than simple analysis and statistics.

It is a lot more common that a data anomaly is caused by a bug in implementing a web framework.

It depends on what you’re working on. For generic descriptive statistics, then I agree, and also that has nothing whatsoever to do with data science.

If you’re trying to do reverse image search or machine translation or creating custom embeddings unique to your business problem at hand, then deep learning is hands down better.

This bolsters my point as well. If you only hired “full stack” data scientists and you’re trusting them to correctly tell you if / how deep learning is applicable to a new problem, instead of hiring specialists who actually know how to systematically diagnose that situation, you’re setting yourself up to fail. You may already be too biased towards believing simpler things “should” do better, and you’ll take the full stack person’s inability to outperform with deep learning as if it is confirmatory evidence, when really all it is telling you is that you need a specialist.

I somewhat agree with you. Someone who is spending time now studying jQuery and becoming proficient at developing web services would nessesarily not be able to keep up with the pace of deep learning. On the other hand, there are people that had managed to become relatively proficient at developing software a decade ago. And spend last decade at becoming proficient at deep learning.

You need more than a decade to become proficient with deep learning at the level of researchers solving novel business problems.

It takes at least a decade just to study the prerequisite materials in vector calculus, linear algebra, advanced statistics, classifier algorithms, convex and gradient-based optimization, matrix computations and numerical methods, and associated software engineering skills. That’s all just to get to “base camp” of deep learning.

On the flip side, it’s pretty low effort to just use plug-n-play network components from popular libraries and follow a few tutorials or open source projects.

That’s why there’s effectively zero employment demand for the skill of naive keras or pytorch lego building. It’s as easy as it is meaningless.

Given that you’d already have been spending a decade+ of your life on advanced math if you planned to work on deep learning to solve real problems, there’s a huge impedance mismatch with this idea that you’d somehow also magically just be happy ignoring that specialized skill and the time investment sunk into it to then instead be happy writing throw-away little Flask apps or optimizing routine ETL queries.

My assumption was "starting from a post-graduate level in computer science, natural sciences or equivalent". By the way, I don't see how anyone could have more than a decade specifically in deep learning, considering that the field had started at around that time.

On a flip side, TensorFlow 2.0 and AutoML are coming ;). And generic RL agents that do not require reward hacking are also on the horizon. Who cares, if a researcher spend 10000 hours reading articles AND 10000 hours building products, if a more general algorithm obsoletes it all ;)

> “My assumption was "starting from a post-graduate level in computer science, natural sciences or equivalent".”

Yes, same for me. This builds in nearly a decade of preparatory work into the timeline... so it seems we agree.

> “On a flip side, TensorFlow 2.0 and AutoML are coming ;). And generic RL agents that do not require reward hacking are also on the horizon.”

I work professionally in deep learning for image processing. This quote reads like parody to me. I cannot imagine anyone familiar with the realities of AutoML or deep reinforcement learning talking this way. It’s like an excerpt from the script of Silicon Valley.

Have you used AutoML in practice for DNN architecture search?

Yes, I have used AutoKeras in practice, with mixed results. I have also written in-house hyperparameter search tooling to spread parametric architecture search in a distributed training environment with about the same mixed success. I have done this for both large-scale image processing networks and natural language processing networks.

Using AutoML in practice is beyond foolish, given the pricing, except for a really small minority of customers. Let alone that neural architecture search is not a silver bullet and frequently is totally not helpful for model selection (for example, say your trade-off space involves severe penalty on runtime and you have a constraint that your deployed runtime system must be CPU-only.. you may trade performance for the sake of reducing convolutional layers, in a super ad hoc business-driven way that does not translate to any type of objective function for NAS libraries to optimize... one of the most important production systems I currently work on has exactly this type of constraint).

Interesting. I agree, it is not trivial to estimate the runtime of the model on a target device. I wonder how Google does it. They've been boasting about precisely this ability - to optimize for architecture under constraints of precision AND runtime for a target device. And then, claiming that they've been able to get an architecture better than one optimized by a team of engineers over a few years.

It’s all hype coming out of Google. Most of this stuff is meant for foisting overpriced solutions onto unwitting GCP customers who get burnt by vendor lock-in and don’t have enough in-house expertise to vet claims about e.g. overpriced TPUs or overpriced AutoML.

>You need more than a decade to become proficient with deep learning at the level of researchers solving novel business problems.

No. Even these people haven’t been doing it for a decade.

Yes. They were doing 4 years of math intensive undergrad and 5+ years of math intensive post-grad work before there even was an advent of deep learning in 2008-2012.

So they were doing deep learning as an undergrad, huh? Studying tensors for 10 years? Next you’ll be telling me they have 20 years experience coding for CUDA.

No. None of these people have the math background you think, nor do you need it.

I don’t want to belabor this point (but apparently I do, since I’m replying to my own post 2 hours later), but your idea of going back and counting undergraduate work — when it isn’t even related — is simply padding for padding sake. Why stop there? Why not 4 years of high school, too? Aren’t they prerequisites? I mean you can go back all the way to preschool, since counting is a prerequisite to math, and hell, is a whole section of discrete math. But you don’t, because claiming you need 25 years of mathematical training sounds ludicrous.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact