
Math in Data Science - Anon84
https://www.dataquest.io/blog/math-in-data-science/
======
notafraudster
"If you want to scrape the surface, a course in elementary statistics would be
fine. If you want a deep conceptual understanding, you’ll probably want to
know how the formula for residual sum of squares in derived, which you can
learn in most courses on advanced statistics."

This is an extremely shallow claim. The basics of regression rely on, let's
see, understanding the normal and t distributions (each of which is described
in papers far beyond an undergraduate statistics degree). To understand how
the rules of probability empirically map to outcomes, you need a good
understanding of the Central Limit Theorem (and thus the Squeeze Theorem,
Continuous Mapping Theorem, Chebyshev's Inequality / WLLN, etc.) all of which
are pretty sophisticated. Congrats, you now know linear regression. What about
all the difference variance estimators, their properties under model
misspecification, simulation properties of violations of Gauss-Markov, robust
standard errors (including more advanced econometric specifications like
cluster-robustness, Satterthwaite approximation of degrees of freedom, etc.)

If you're willing to take a textbook's word for it, actually using most data
science stuff is as easy as spending an hour on coursera or reading some free
web pdf book. Who needs any math skill? Just close your eyes when the book
covers math. Like how a mechanic doesn't need to know any physics, the data
science user doesn't need to know any math or statistics. If the pretty stars
show up in the regression table you found the truth.

So, yes, on this level, data science requires a high school or early
undergraduate understanding of calculus no more sophisticated than basic
algebra, the first few weeks of matrix algebra, and the first few weeks of
Calculus 1 and 2.

But on the other hand, to actually understand, prove, and improve on existing
techniques requires the equivalent of upper division undergraduate mathematics
and a graduate education in statistics.

~~~
mlevental
there's a salty purist like you listing buzzwords from casella and berger in
every thread. your brethren in the data structures threads are those repping
CLRS or taocp. and yet data science as an industry marches on (not to say
anything of dev).

>prove, and improve on existing techniques

how many "data scientists" in industry are proving or improving existing
techniques in their day to day?

>Like how a mechanic doesn't need to know any physics

this is absolutely correct since, you know, they're able to provide value for
customers by fixing their cars

i would say that this is exactly what elitism looks like: you take a look at
someone that's successful but under different terms than you yourself see fit,
and claim it's not genuine success.

~~~
6gvONxR4sf7o
There's a huge difference. If I ask "why is this model applicable to this
situation?" or even "is this model applicable to this situation?" then I want
the practitioner to have the ability to provide and understand the answer.
Yes, data science as an industry, but that's because wrong answers are largely
undetectable.

If I roll out the treatment cell of an experiment but shouldn't have, because
I didn't account for within-client correlation, the world marches on and the
business thinks they are making data driven decisions. They'll never know they
left millions on the table. Or maybe I'm right for the wrong reasons and I was
merely more confident than I should have been and the world still marches on
oblivious because there are very very few correction mechanisms.

~~~
chibg10
The academic math types with no ML experience don't fare much better than the
no-math practitioners with experience from what I've seen. They generally have
terrible mental models of how the math behaves when applied to real data.

IME the competent data scientist needs to have a high-level understanding of
the concepts underlying ML (stats, optimization, central limit theorem) and to
be comfortable reading math to understand the low-level details when they are
relevant. In addition to actually understanding the ML techniques they're
using obviously.

On top of that, I would rather work with someone who can code well than a pure
mathematician with a ton of experience proving theorems (and no coding
ability). Outside of a few industry research jobs, the cost:benefit ratio of
making substantial theoretical breakthroughs in most everyday ML work doesn't
justify a lot of time doing theoretical proofs.

~~~
otabdeveloper4
The person you're replying to isn't talking about "theoretical breakthroughs",
he's talking about a basic ability to interpret and detect errors in your
model.

These are abilities that "data scientists" do, indeed, lack.

------
goodmattg
This whole "how much math is required to work in ______" always feels like
reading Rashomon. Everyone is looking at the same situation, but no one sees
it the same way. Data Science teams range in size and sophistication across
various industries, so obviously "what it takes to be a data scientist" varies
based on the company.

Want to be a data scientist in a large org. with limited data science
sophistication? You may not need an advanced mathematics background to make an
impact; out-of-the-box models/techniques might produce actionable results if
you have access to clean data.

Try doing the same thing at sophisticated tech companies (FANG, BAT, etc.).
The datasets are massive, heterogenous, and often unstructured. Add to that,
large teams full of former Ph.Ds and clever engineers have already given the
common problems a go, so any better solution will need to be 'more clever'.
It's difficult to be 'more clever' without advanced math.

What is hilarious about these articles is none of them ever say "Math as a
SENIOR data scientist", because we all acknowledge that senior scientists know
the math. Almost every job listing for senior data scientist will prefer a
Ph.D. On larger teams, the Ph.D (advanced math) crowd will come up with the
general approach that junior data scientists will work on. So yes, getting in
the door as a data scientist requires having enough math/programming so you
can at least understand what a senior scientist wants you to do. But if you
want to lead a team in a technical capacity, you can't possibly imagine doing
that without really knowing the math or having a brilliant track record.

And if you are the rockstar 10x guru Data Scientist who consistently delivers
actionable insights without a strong math background, you deserve the respect
of everyone here and will probably be my boss someday.

~~~
denzil_correa
> What is hilarious about these articles is none of them ever say "Math as a
> SENIOR data scientist", because we all acknowledge that senior scientists
> know the math. Almost every job listing for senior data scientist will
> prefer a Ph.D. On larger teams, the Ph.D (advanced math) crowd will come up
> with the general approach that junior data scientists will work on. So yes,
> getting in the door as a data scientist requires having enough
> math/programming so you can at least understand what a senior scientist
> wants you to do. But if you want to lead a team in a technical capacity, you
> can't possibly imagine doing that without really knowing the math or having
> a brilliant track record.

In other words, being a Data Scientist requires performing "scientific"
activities. Most of these articles focuses on the "Data" part and very rarely
about "science". Science requires excellent scholarship - synthesis of current
situation and putting future proposals in context of the current "jigsaw
puzzle". Now, that is a skill that you can not learn by reading - you need to
practice it for someone period of time. No wonder, Ph.D is a requirement for
such positions.

------
ktpsns
> If you’ve completed a math degree or some other degree that provides an
> emphasis on quantitative skills, you’re probably wondering if everything you
> learned to get your degree was necessary.

It should be emphasized that obtaining a degree in a subject like math and
physics should "form" your way of thinking in a certain way: That you are able
to apply methods to other fields, such as data science. That's way more
important then remembering "everything you learned". You might never do a
Fourier transformation by hand again, but you know the concept. And you can
apply it.

~~~
uoaei
I've met plenty of bootcamp graduates who "know the math" but still can't move
from rote recitation to intuitive relation. You can ask for the form of the
loss function for "classification" (i.e., the usual cross-entropy) and get it
every time. But it is apparent that without a lot of time (which becomes
significantly shorter with prior physics/compsci/STEM experience) they can't
look at the various tools as building blocks and assemble them into novel
machine learning systems. They only understand inductive bias as it pertains
to answering trivia about the differences between CNNs and RNNs. Disclaimer to
say this is an anecdote but one that is reflected across a number of my
colleagues, most of whom I trust not to fall prey to numerous cognitive
biases.

------
b_tterc_p
Hmm decent math proficiency is necessary to understand data science, but I
don’t think it’s much of a differentiator.

Communication and critical thinking, not unlike many jobs, are the more
important things. IMO many data scientists suck at the former.

~~~
willis936
Being able to do your job is always the number 1 requirement of doing your
job. Being able to communicate and think critically are nice-to-haves. A data
scientist who doesn't know data science is useless.

------
windsignaling
I think this article was written for SEO.

Calculus, linear algebra, and probability would cover pretty much all of the
algorithms mentioned in the article.

------
sorokod
That to data science is like McDonald's to food.

~~~
notafraudster
Your comment was downvoted presumably because it is flip, but I think there's
a kernel of truth here. The blog post correctly highlights that limited math
is needed to "do" data science, but this is why we have so many irresponsible
data scientists.

I'm currently teaching a data science class intended for mid-career
professionals. It requires no college level math. And virtually every topic, I
feel like I am doing them an injustice by not being able to delve into the
mathematics behind what we're doing. They'll be able to put on their CVs that
they "do data science", but I don't trust that my instruction gets them to the
level that they really understand much about it.

