That's perfectly fine but it's not what traditionally is referred to as data science. I'm actually quite annoyed at what has been happening to the term data science lately - it's supposed to be some stats-heavy/applied-AI role but a lot of companies hiring "data scientists" are really just hiring SQL jockeys.
Personally I've done both data science and data infrastructure and I like infrastructure a lot more anyway. And it sounds like you are somewhat qualified for that with some of your pipeline work (although big data experience is also important). A LOT of data science departments have no idea what type of business value they are supposed to be adding, are doing shitty boring work with glorified titles, or are improperly integrated with the company at large (bad productionizing processes, poor data infrastructure). There's always going to be a need for data infrastructure but the "data science" hype is going to fade once all the shitty data departments cut the fat.
When I interview people who have your type of background, I tend to get confused by what exactly it is the person wants to do (Analyze Data? Build an Analytics Pipeline/Architecture? Write Software/Services? Be an Analytics IT person?). Sometimes they have to talk about all the "cool" stuff they've done and lose focus on what they bring to the role. I also become skeptical because it's really easy nowadays to follow a few tutorials on 50 different things and then boast about how you did it all yourself.
Even reading your comment, you don't sound like somebody who wants to analyze data.
As far as being a generalist, I definitely agree that it's good to have somebody with skills in ETL, analyzing data, and maybe building a software service. But what happens is that all those things happen at different speeds and then people get crushed. You're asked to investigate some data quickly over 3 days, but suddenly the software service you built is having issues and you need 2 weeks to dig in and fix it, and also your ETL job is overloading the server you need to fix it yesterday but you need help from a somebody else to figure it out. Oh, and that Metabase thing you installed is broken and the VP was using it and has a big demo tomorrow.
But this is the crux of the job-seeker's dilemma. If he/she is specific about their interests when speaking to an interviewer, they might get a response like "well, we're really looking for someone whose operational focus is [something else]".
And if they're not super-specific (I doubt anyone does data analysis exclusively without any other involvement in the project), but instead attempt to give examples where they had demonstrable impact working across a number of domains, you might hear a response like this:
> Even reading your comment, you don't sound like somebody who wants to analyze data.
But if you have a resume (or say this during an interview) that gives equal weight to the data analysis and the stack deployment, it's just confusing to the person reading it. Especially in Data Science, which already confusing from a skillset perspective. Lots of resumes look like the applicants just thought 10 things with minimal overlap were cool and decided to put them on their resume.
Even if you did work at a 5 person startup and had the unoffical title of "Data Scientist, Data Engineer, Data DevOps, DB Admin, and Chief Data Officer" I'd recommend you downplay some of those based on the jobs you are applying for. Figure out what is essential and what is +1
But you realize that the vast majority of businesses in North America need someone to solve all of those problems. They aren't going to hire and cna't afford an experienced data team of specialists.
This is the point of the article. Getting the 80% is far more valuable than having some PhD optimizing the hell out of features. Silicon Valley tends to overthink things.
The vast majority of business don't need Data Scientists. They need a BI person with SQL skills and some of the skills of a Database Admin. What most companies really need is a good set of Dashboards and clean data to feed it. This enables the business people to get the information/visibility they need an make decisions.
Also, most businesses should not be building analytic services and deploying them - they should be paying for a good product with a cloud or easy on-prem install and getting support from the company that sells the product. A few licenses of a good BI product are a lot cheaper than a Data Scientist.
When I am up and running I don't want yet another generalist - or rather I will happily take one, I just will put them in a box making pins.
Perhaps the GP will do better at the consulting level - or even some level of productise consulting - and out of the box product
The trouble you might be having in getting an interview is probably partly to do with your background and likely also that those job postings get A LOT of submissions. Other hiring managers in my department as well as myself have found we get 10x more submissions for DS/ML positions than software dev positions. In general it's a really unrefined and new job skills that anyone and everyone who's taken a coursera course in regression or clustering will apply.
When I'm hiring I care most about wins. These are wins. When I read 30M events / month I want to hear more.
The rest is fluff IMO and things that I'd expect you to play around with while you're self learning. Also, most hiring managers don't care about what you did 10 years ago in sales if you're applying for a data science role. It might be icing on the cake you can share if you get into a conversation about sales or marketing, but otherwise it can feel off topic.
I'd slim down your resume to focus on these two wins (plus any recent experience building or leading teams) and stay later focused on recent data science related work in production. That sounds like a good enough resume to get an interview at most companies. Good luck!
Losses can be just as important, IME. “I tried to do X, tried several methods, each one failed due to...” would be an interesting conversation to have with an interviewee.
Check out the series by Jeff Leek, Brian Caffo and Roger D. Peng, “A Crash Course in Data Science.”
Hope this helps, I can be pretty clueless sometimes so you probably already know all the mathy bits.
This certainly doesn't excuse hiring managers from lazily filtering out candidates who don't have these things. But these are strong signals.
The skills you have are useful. I know we wouldn't hire you in for a data science role on my team. I could imagine many other places they're useful, though, so perhaps it's the places you're searching.
You can also train entry level for a good amount less too.
Email me: mark at dotscience dot com
If you don't want to be banned, you're welcome to email email@example.com and give us reason to believe that you'll follow the rules in the future.
It’s not, though. I know a data scientist who studied Arabic language. And software engineers who studied music.
Your comment reads more like sour grapes about your own college experience than anything else ;)
The author kind of builds a strawman of super specialized data scientists that constantly throw code over the wall to someone else. That doesn’t work, and you simply can’t do that unless your headcount is in the thousands. You have to have people that can productionize their work. At the same time, he’s arguing that scientists should should be maintaining their own data infrastructure, but that’s not good either.
The best advice I was given was to hire people either to make you smarter, or to make you stronger/faster. You hire data scientists and ML experts to make you smarter. They should be working on problems that you can’t solve today. Infrastructure on the other hand, isn’t your product. It’s overhead. It’s a tool. Comparatively, it’s easier to hire people to build and maintain your infrastructure. Hire people to do that. All the time your scientists are dealing with infrastructure, is time they could be doing useful work.
All that said, know when you should just shove the infrapeople aside and do it yourself.
Some of that (e.g. datawarehousing, etc.) is easier to outsource; other parts (data acquisition from your product, ETL design, etc.) are necessarily bespoke to your company an thus not readily "buyable." I understand OP to be arguing roughly "you can get a good DBA for much cheaper than you can get a good ML Engineer (much less a good ML Engineer who's ALSO a good DBA), so there's no sense in making Database management part of the Data Scientist role."
Maturity and also scale - I suppose a small or even one-man shop requiring a generalist could be mature. Once you get to a certain size specialization happens automatically.
Being able to communicate is key in BI because this enables you to focus on the right business problems.
A data science generalist may work fine for a small data shop but as you grow and expand data science in your organization, we know the next step to increase productivity involves specialization (AKA division of labor). It happens not just in data science, but in all business functions and with all business roles.
Marketing, Sales, Finance, Engineering, Operations - every business function uses specialization to get productivity gains. So while generalists may work for you if you’re a small business or a large business spinning up a new business function, specialization is a proven economic tool for productivity gains as you grow.
Interestingly, as a business function grows, the communication costs and the ensuing delays increase and this is a known side-effect of specialization within that business function. This doesn’t mean one throws away specialization and runs to the other extreme of the spectrum with their use of generalists. There’s a tradeoff organizations make here and there’s been a lot of experimentation done in this space like - Amazon's two-pizza teams (https://zurb.com/word/two-pizza-team), Spotify’s Squads, etc - these organizational structures are not universally applicable but they’re interesting developments to look at.
Shameless Plug (on current state of data science market) - https://medium.com/open-factory/state-of-the-m-art-big-data-...
Theoretically speaking...it's much more efficient to have the specialists doing what they do best instead of trying to learn how to optimize SQL queries or whatever.
I independently developed a financial analysis expert system, with a strong ability to innovate and execute.
All my expertise is entirely self-taught.
My technology Blog:
This reads like a desperate business person wrote it, who wishes that one full-stack set of drives made sense and coexisted in a single person to make that labor cheaper and more commidity, despite the reality that it’s simply not true.
The person who spent the time to master web service frameworks, query languages and product engineering necessarily did not also master professional level knowledge of deep learning or MCMC sampling or natural language processing.
The two types of people need to coexist and work symbiotically, but it’s just asinine wishful thinking to pretend like they are the same person, let alone to write a baseless essay full of assertions that if they aren’t the same person it somehow results in first principles economic inefficiency.
There’s no burden on anyone to refute anything from this piece, as the piece itself has not met any basic requirement of presenting facts or evidence in the first place.
It’s merely a matter of fact to point out this deficiency of the article. The premises of the article could still be accurate (though I think that is fleetingly unlikely), but even if so, this article does not justify any of those claims, so nobody could know one way or the other from this article. Again, this is just a matter of observation of the justifications given.
This author would personally find it more convenient if the skillset of data scientists and data platform engineers coexisted in one person who also happened to have the drive to undertake employment spanning all those skill sets, and wouldn’t become unhappy if the employer did not respect specializations. So this author has decided to read tea leaves out of economic principles and superimpose this wish as if it was justified by some first principles analysis.
In fact, this wishful thinking seems exactly in line with the flawed perspective that executives or director level employees will have. They don’t want to have to care about motivation and intellectual curiosity required to keep certain kinds of knowledge workers happy & productive, and spend lots of time trying to justify how their business units embody corporate platitudes about customer-driven passion. It’s quite easy to see why they would fall victim to this sort of naive wishful thinking. It’s quite similar to CTOs getting suckered by turn-key consulting solutions. It’s not even surprising that VPs & C-suite executives would be very wrong about this type of work.
This is what I was referring to when I said argument from authority :)
From https://en.wikipedia.org/wiki/Argument_from_authority : a fallacy to cite an authority on the discussed topic as the primary means of supporting an argument
PS: Thanks for pointing out ad hominem :)
 I don't know the experience of the person I was responding to, so I'm making an assumption.
It is a lot more common that a data anomaly is caused by a bug in implementing a web framework.
If you’re trying to do reverse image search or machine translation or creating custom embeddings unique to your business problem at hand, then deep learning is hands down better.
This bolsters my point as well. If you only hired “full stack” data scientists and you’re trusting them to correctly tell you if / how deep learning is applicable to a new problem, instead of hiring specialists who actually know how to systematically diagnose that situation, you’re setting yourself up to fail. You may already be too biased towards believing simpler things “should” do better, and you’ll take the full stack person’s inability to outperform with deep learning as if it is confirmatory evidence, when really all it is telling you is that you need a specialist.
It takes at least a decade just to study the prerequisite materials in vector calculus, linear algebra, advanced statistics, classifier algorithms, convex and gradient-based optimization, matrix computations and numerical methods, and associated software engineering skills. That’s all just to get to “base camp” of deep learning.
On the flip side, it’s pretty low effort to just use plug-n-play network components from popular libraries and follow a few tutorials or open source projects.
That’s why there’s effectively zero employment demand for the skill of naive keras or pytorch lego building. It’s as easy as it is meaningless.
Given that you’d already have been spending a decade+ of your life on advanced math if you planned to work on deep learning to solve real problems, there’s a huge impedance mismatch with this idea that you’d somehow also magically just be happy ignoring that specialized skill and the time investment sunk into it to then instead be happy writing throw-away little Flask apps or optimizing routine ETL queries.
On a flip side, TensorFlow 2.0 and AutoML are coming ;). And generic RL agents that do not require reward hacking are also on the horizon. Who cares, if a researcher spend 10000 hours reading articles AND 10000 hours building products, if a more general algorithm obsoletes it all ;)
Yes, same for me. This builds in nearly a decade of preparatory work into the timeline... so it seems we agree.
> “On a flip side, TensorFlow 2.0 and AutoML are coming ;). And generic RL agents that do not require reward hacking are also on the horizon.”
I work professionally in deep learning for image processing. This quote reads like parody to me. I cannot imagine anyone familiar with the realities of AutoML or deep reinforcement learning talking this way. It’s like an excerpt from the script of Silicon Valley.
Using AutoML in practice is beyond foolish, given the pricing, except for a really small minority of customers. Let alone that neural architecture search is not a silver bullet and frequently is totally not helpful for model selection (for example, say your trade-off space involves severe penalty on runtime and you have a constraint that your deployed runtime system must be CPU-only.. you may trade performance for the sake of reducing convolutional layers, in a super ad hoc business-driven way that does not translate to any type of objective function for NAS libraries to optimize... one of the most important production systems I currently work on has exactly this type of constraint).
No. Even these people haven’t been doing it for a decade.
No. None of these people have the math background you think, nor do you need it.