This is an extremely shallow claim. The basics of regression rely on, let's see, understanding the normal and t distributions (each of which is described in papers far beyond an undergraduate statistics degree). To understand how the rules of probability empirically map to outcomes, you need a good understanding of the Central Limit Theorem (and thus the Squeeze Theorem, Continuous Mapping Theorem, Chebyshev's Inequality / WLLN, etc.) all of which are pretty sophisticated. Congrats, you now know linear regression. What about all the difference variance estimators, their properties under model misspecification, simulation properties of violations of Gauss-Markov, robust standard errors (including more advanced econometric specifications like cluster-robustness, Satterthwaite approximation of degrees of freedom, etc.)
If you're willing to take a textbook's word for it, actually using most data science stuff is as easy as spending an hour on coursera or reading some free web pdf book. Who needs any math skill? Just close your eyes when the book covers math. Like how a mechanic doesn't need to know any physics, the data science user doesn't need to know any math or statistics. If the pretty stars show up in the regression table you found the truth.
So, yes, on this level, data science requires a high school or early undergraduate understanding of calculus no more sophisticated than basic algebra, the first few weeks of matrix algebra, and the first few weeks of Calculus 1 and 2.
But on the other hand, to actually understand, prove, and improve on existing techniques requires the equivalent of upper division undergraduate mathematics and a graduate education in statistics.
Every stats program I know of teaches each of these concepts quite rigorously, and I can't imagine a stats program without them.
I agree with your broader claim that to actually improve on existing techniques requires graduate education, but understanding existing techniques is well within the reach of most college students.
I don't think most undergraduate stats programs deal with the dart-board derivation of the normal distribution, or read Gosset's original t paper, and moment generating functions are rarely discussed.
Here's an example of a fairly elementary question I don't think a stats undergrad would learn without self-guided study: You run a linear regression. One or more of your outcomes or regressors has a skewed distribution. Standard errors are often motivated under CLT claims of normal convergence. Suppose convergence is poor in your finite-sample. What are the consequences of this for your estimates and standard errors in terms of nominal coverage, bias, and consistency?
It's possible an undergraduate class would cover Gauss-Markov and mention violations of normality, and maybe get into mitigations via transforming parameters... but to actually deep dive into the skills necessary to self-verify the kinds of consequences observed in the data you have? Nah.
I mention this because it's a question that first came up to me in a graduate regression class and I was not well prepared for it, and now when I teach statistics undergrads I don't think they would be.
Never mind any kind of analytic result, the notion of really digging in and running the kind of MC simulation required to get empirical results is typically a fourth year undergraduate topic.
>prove, and improve on existing techniques
how many "data scientists" in industry are proving or improving existing techniques in their day to day?
>Like how a mechanic doesn't need to know any physics
this is absolutely correct since, you know, they're able to provide value for customers by fixing their cars
i would say that this is exactly what elitism looks like: you take a look at someone that's successful but under different terms than you yourself see fit, and claim it's not genuine success.
That cracked me up lol. I was hired as a data science at big tech with only some high level data science skills. I first learned to write productuon quality code. I then (and now) am working through Casella Berger, I'm 1/3 through. It's hard! I've had to stop and brush up on math a lot.
I love learning theoretical stats, I feel it makes me a better data scientist. But I also know I've worked with phd data scienctists who worked through all of C&B but couldn't code, and they weren't as valuable as the more self-taught code-first data scientists (at least on my team).
>I'm up on calculus and linear algebra, but I only took one statistics course as an undergrad 25 years ago.
it is a serious graduate math stats book but i personally don't consider it a graduate math book (because it does not use measure theory). it is proof focused because it's used for foundational grad stats classes (i.e. prep for quals) but the proofs aren't that hard to follow (if you've had a good calculus class [epsilon-delta definition of limits]).
some of the exercises are brutal algebraically though. good things there's a solutions manual floating around :)
This is definitely true, and I wouldn't recommend this book to anyone unless they considered this a feature. For me, it was/is a feature, because I suck at algebra, and this is helpful.
If I roll out the treatment cell of an experiment but shouldn't have, because I didn't account for within-client correlation, the world marches on and the business thinks they are making data driven decisions. They'll never know they left millions on the table. Or maybe I'm right for the wrong reasons and I was merely more confident than I should have been and the world still marches on oblivious because there are very very few correction mechanisms.
IME the competent data scientist needs to have a high-level understanding of the concepts underlying ML (stats, optimization, central limit theorem) and to be comfortable reading math to understand the low-level details when they are relevant. In addition to actually understanding the ML techniques they're using obviously.
On top of that, I would rather work with someone who can code well than a pure mathematician with a ton of experience proving theorems (and no coding ability). Outside of a few industry research jobs, the cost:benefit ratio of making substantial theoretical breakthroughs in most everyday ML work doesn't justify a lot of time doing theoretical proofs.
These are abilities that "data scientists" do, indeed, lack.
Want to be a data scientist in a large org. with limited data science sophistication? You may not need an advanced mathematics background to make an impact; out-of-the-box models/techniques might produce actionable results if you have access to clean data.
Try doing the same thing at sophisticated tech companies (FANG, BAT, etc.). The datasets are massive, heterogenous, and often unstructured. Add to that, large teams full of former Ph.Ds and clever engineers have already given the common problems a go, so any better solution will need to be 'more clever'. It's difficult to be 'more clever' without advanced math.
What is hilarious about these articles is none of them ever say "Math as a SENIOR data scientist", because we all acknowledge that senior scientists know the math. Almost every job listing for senior data scientist will prefer a Ph.D. On larger teams, the Ph.D (advanced math) crowd will come up with the general approach that junior data scientists will work on. So yes, getting in the door as a data scientist requires having enough math/programming so you can at least understand what a senior scientist wants you to do. But if you want to lead a team in a technical capacity, you can't possibly imagine doing that without really knowing the math or having a brilliant track record.
And if you are the rockstar 10x guru Data Scientist who consistently delivers actionable insights without a strong math background, you deserve the respect of everyone here and will probably be my boss someday.
In other words, being a Data Scientist requires performing "scientific" activities. Most of these articles focuses on the "Data" part and very rarely about "science". Science requires excellent scholarship - synthesis of current situation and putting future proposals in context of the current "jigsaw puzzle". Now, that is a skill that you can not learn by reading - you need to practice it for someone period of time. No wonder, Ph.D is a requirement for such positions.
It should be emphasized that obtaining a degree in a subject like math and physics should "form" your way of thinking in a certain way: That you are able to apply methods to other fields, such as data science. That's way more important then remembering "everything you learned". You might never do a Fourier transformation by hand again, but you know the concept. And you can apply it.
Communication and critical thinking, not unlike many jobs, are the more important things. IMO many data scientists suck at the former.
Calculus, linear algebra, and probability would cover pretty much all of the algorithms mentioned in the article.
I'm currently teaching a data science class intended for mid-career professionals. It requires no college level math. And virtually every topic, I feel like I am doing them an injustice by not being able to delve into the mathematics behind what we're doing. They'll be able to put on their CVs that they "do data science", but I don't trust that my instruction gets them to the level that they really understand much about it.