To be blunt, the market is saturated with people who call themselves “data scientists” but are actually just reasonably skilled software engineers with at best a college sophomore level understanding of math/stats.
On the other hand, the market is nowhere near saturation for people with both advanced software engineering and math/stats skills (i.e. PhD-level).
I don't think that is correct. I don't think the market actually needs phd level stats skills. Its just like the market doesn't need phd level pure math or doesn't need phd level CS. The number of situations where these specialized knowledge is useful in industry is vanishingly small. Having a phd is more about being even considered for an interview for such a position. The actual day to day knowledge used will consist of topics that even a person with a bachelor could require without putting in years.
Thanks for making my point more clearly than I did!
As an aside, regarding whether people actually do PhD-level math at these jobs, the answer is indeed often no. However, the jobs can still require PhD-level experience. This is because for the average person, there is a lag between being able to merely learn concepts at a given level versus being able to actually synthesize those concepts to solve novel problems. It is relatively easy to learn a subject and solve exercises in that subject that you know pertain to the concepts you just studied, as you would encounter in a course. It is much harder to be given a problem out of the blue and realize what concepts are required to solve it, as you would encounter in a scientific career.
As a concrete example, I once was explaining neural networks to a bright college freshman. I showed him the forward pass equation, then asked him how he would optimize the network weights given said equation. Even though he learned the chain rule in his courses, he didn’t think to apply it to derive the backpropagation step. By contrast, a talented junior or senior can easily figure this out.
In my experience, for the average person, the learning/synthesis gap is usually a few years. Hence, your average new PhD-level data scientist would be capable of synthesizing advanced undergraduate material towards solving novel problems in their job. And there are a hell of a lot of data science jobs that require that.
I'm a data science hiring manager. Other things besides just the raw academic credential (a PhD which wouldn't typically be in "Data Science" anyway, i.e., even the PhD's are just Data-Science-adjacent) are publications, conference appearances/posters, generated data products/pipelines, and contributions to relevant software. (For me, in that approximate order.)
I'd work on building some of those, because you do need to stand out against the field @MonteCarloHall described. (Software engineer + undergrad math/stats)
Those kind of achievements would satisfy screening filters. Then of course you'd have to have knowledge to back that up. I think it's reasonable to say that typically this will be domain-specific, e.g. you would end up with a different background knowledge base for NLP than for spatiotemporal problems than for network/graph problem domains. With all the growth has come specialization.
Need to show > publications, conference appearances/posters, generated data products/pipelines, and contributions to relevant software.
These appear to be in conflict. If companies need people with such skills, they can't just hope to get the elite few who present at conferences, they need to be hiring among the audience members as well. If they are in fact just hiring the speakers, then the jobs aren't really "bountiful"
> are publications AND/OR conference appearances/posters AND/OR generated data products/pipelines AND/OR contributions to relevant software. (For me, in that approximate order.)
^--- Taps sign
The above does not imply that I'd only consider conference speakers, right? (Although, there are a lot of conferences!) For some roles, contributions to relevant software would be the right standard.
I consolidated for brevity's sake, how could you not realize that. You are still talking about only considering the 1-3% (of conference speakers, article publishers, contributors to popular repos, pipeline generators) not the other 97-99% of audience members, article readers, users of those repos, pipeline consumers.
On the other hand, the market is nowhere near saturation for people with both advanced software engineering and math/stats skills (i.e. PhD-level).