Hacker News new | past | comments | ask | show | jobs | submit login
Math in Data Science (dataquest.io)
153 points by Anon84 72 days ago | hide | past | web | favorite | 21 comments



"If you want to scrape the surface, a course in elementary statistics would be fine. If you want a deep conceptual understanding, you’ll probably want to know how the formula for residual sum of squares in derived, which you can learn in most courses on advanced statistics."

This is an extremely shallow claim. The basics of regression rely on, let's see, understanding the normal and t distributions (each of which is described in papers far beyond an undergraduate statistics degree). To understand how the rules of probability empirically map to outcomes, you need a good understanding of the Central Limit Theorem (and thus the Squeeze Theorem, Continuous Mapping Theorem, Chebyshev's Inequality / WLLN, etc.) all of which are pretty sophisticated. Congrats, you now know linear regression. What about all the difference variance estimators, their properties under model misspecification, simulation properties of violations of Gauss-Markov, robust standard errors (including more advanced econometric specifications like cluster-robustness, Satterthwaite approximation of degrees of freedom, etc.)

If you're willing to take a textbook's word for it, actually using most data science stuff is as easy as spending an hour on coursera or reading some free web pdf book. Who needs any math skill? Just close your eyes when the book covers math. Like how a mechanic doesn't need to know any physics, the data science user doesn't need to know any math or statistics. If the pretty stars show up in the regression table you found the truth.

So, yes, on this level, data science requires a high school or early undergraduate understanding of calculus no more sophisticated than basic algebra, the first few weeks of matrix algebra, and the first few weeks of Calculus 1 and 2.

But on the other hand, to actually understand, prove, and improve on existing techniques requires the equivalent of upper division undergraduate mathematics and a graduate education in statistics.


The basics of regression rely on, let's see, understanding the normal and t distributions (each of which is described in papers far beyond an undergraduate statistics degree).

Every stats program I know of teaches each of these concepts quite rigorously, and I can't imagine a stats program without them.

I agree with your broader claim that to actually improve on existing techniques requires graduate education, but understanding existing techniques is well within the reach of most college students.


I missed your reply contemporaneously. Of course undergrad stats programs introduce the normal/t distributions; but often in either a matter-of-fact way (the normal PDF is ...) or in a sort of intuitive way (think of t as a normal distribution, but punishing ourselves for the additional uncertainty of needing to estimate...), but rarely in a beginning to end way. I'm not critiquing this as pedagogy, although I think a lot of students get bombarded with poorly motivated distributions that aren't well connected to a broader problem class. But that being said, I think we teach students how to characterize distributions rather than how to prove or derive them from first principles.

I don't think most undergraduate stats programs deal with the dart-board derivation of the normal distribution, or read Gosset's original t paper, and moment generating functions are rarely discussed.

Here's an example of a fairly elementary question I don't think a stats undergrad would learn without self-guided study: You run a linear regression. One or more of your outcomes or regressors has a skewed distribution. Standard errors are often motivated under CLT claims of normal convergence. Suppose convergence is poor in your finite-sample. What are the consequences of this for your estimates and standard errors in terms of nominal coverage, bias, and consistency?

It's possible an undergraduate class would cover Gauss-Markov and mention violations of normality, and maybe get into mitigations via transforming parameters... but to actually deep dive into the skills necessary to self-verify the kinds of consequences observed in the data you have? Nah.

I mention this because it's a question that first came up to me in a graduate regression class and I was not well prepared for it, and now when I teach statistics undergrads I don't think they would be.

Never mind any kind of analytic result, the notion of really digging in and running the kind of MC simulation required to get empirical results is typically a fourth year undergraduate topic.


there's a salty purist like you listing buzzwords from casella and berger in every thread. your brethren in the data structures threads are those repping CLRS or taocp. and yet data science as an industry marches on (not to say anything of dev).

>prove, and improve on existing techniques

how many "data scientists" in industry are proving or improving existing techniques in their day to day?

>Like how a mechanic doesn't need to know any physics

this is absolutely correct since, you know, they're able to provide value for customers by fixing their cars

i would say that this is exactly what elitism looks like: you take a look at someone that's successful but under different terms than you yourself see fit, and claim it's not genuine success.


>there's a salty purist like you listing buzzwords from casella and berger in every threa

That cracked me up lol. I was hired as a data science at big tech with only some high level data science skills. I first learned to write productuon quality code. I then (and now) am working through Casella Berger, I'm 1/3 through. It's hard! I've had to stop and brush up on math a lot. I love learning theoretical stats, I feel it makes me a better data scientist. But I also know I've worked with phd data scienctists who worked through all of C&B but couldn't code, and they weren't as valuable as the more self-taught code-first data scientists (at least on my team).


I had never heard of Casella & Berger (but I'm the salty guy on the data structures threads going on about how great TAOCP is). I looked it up on Amazon and it looks interesting and like exactly what I've been looking for, but there's no "look inside this book" or even a table of contents on there. Before I pick it up, how much math background would I probably need to be able to penetrate it? I'm up on calculus and linear algebra, but I only took one statistics course as an undergrad 25 years ago.


there's a pdf floating around - for me it's the first hit when i search casella berger on google

>I'm up on calculus and linear algebra, but I only took one statistics course as an undergrad 25 years ago.

it is a serious graduate math stats book but i personally don't consider it a graduate math book (because it does not use measure theory). it is proof focused because it's used for foundational grad stats classes (i.e. prep for quals) but the proofs aren't that hard to follow (if you've had a good calculus class [epsilon-delta definition of limits]).

some of the exercises are brutal algebraically though. good things there's a solutions manual floating around :)


> some of the exercises are brutal algebraically though

This is definitely true, and I wouldn't recommend this book to anyone unless they considered this a feature. For me, it was/is a feature, because I suck at algebra, and this is helpful.


There's a huge difference. If I ask "why is this model applicable to this situation?" or even "is this model applicable to this situation?" then I want the practitioner to have the ability to provide and understand the answer. Yes, data science as an industry, but that's because wrong answers are largely undetectable.

If I roll out the treatment cell of an experiment but shouldn't have, because I didn't account for within-client correlation, the world marches on and the business thinks they are making data driven decisions. They'll never know they left millions on the table. Or maybe I'm right for the wrong reasons and I was merely more confident than I should have been and the world still marches on oblivious because there are very very few correction mechanisms.


The academic math types with no ML experience don't fare much better than the no-math practitioners with experience from what I've seen. They generally have terrible mental models of how the math behaves when applied to real data.

IME the competent data scientist needs to have a high-level understanding of the concepts underlying ML (stats, optimization, central limit theorem) and to be comfortable reading math to understand the low-level details when they are relevant. In addition to actually understanding the ML techniques they're using obviously.

On top of that, I would rather work with someone who can code well than a pure mathematician with a ton of experience proving theorems (and no coding ability). Outside of a few industry research jobs, the cost:benefit ratio of making substantial theoretical breakthroughs in most everyday ML work doesn't justify a lot of time doing theoretical proofs.


The person you're replying to isn't talking about "theoretical breakthroughs", he's talking about a basic ability to interpret and detect errors in your model.

These are abilities that "data scientists" do, indeed, lack.


My impression of working closely with data-scientists is that what you describe comes more from experience, than from purely theoretical knowledge of the math behind. Ideally it's both, but if having to choose, I'd rather bet on someone who did it already a dozen times and knows what to expect.


This whole "how much math is required to work in ______" always feels like reading Rashomon. Everyone is looking at the same situation, but no one sees it the same way. Data Science teams range in size and sophistication across various industries, so obviously "what it takes to be a data scientist" varies based on the company.

Want to be a data scientist in a large org. with limited data science sophistication? You may not need an advanced mathematics background to make an impact; out-of-the-box models/techniques might produce actionable results if you have access to clean data.

Try doing the same thing at sophisticated tech companies (FANG, BAT, etc.). The datasets are massive, heterogenous, and often unstructured. Add to that, large teams full of former Ph.Ds and clever engineers have already given the common problems a go, so any better solution will need to be 'more clever'. It's difficult to be 'more clever' without advanced math.

What is hilarious about these articles is none of them ever say "Math as a SENIOR data scientist", because we all acknowledge that senior scientists know the math. Almost every job listing for senior data scientist will prefer a Ph.D. On larger teams, the Ph.D (advanced math) crowd will come up with the general approach that junior data scientists will work on. So yes, getting in the door as a data scientist requires having enough math/programming so you can at least understand what a senior scientist wants you to do. But if you want to lead a team in a technical capacity, you can't possibly imagine doing that without really knowing the math or having a brilliant track record.

And if you are the rockstar 10x guru Data Scientist who consistently delivers actionable insights without a strong math background, you deserve the respect of everyone here and will probably be my boss someday.


> What is hilarious about these articles is none of them ever say "Math as a SENIOR data scientist", because we all acknowledge that senior scientists know the math. Almost every job listing for senior data scientist will prefer a Ph.D. On larger teams, the Ph.D (advanced math) crowd will come up with the general approach that junior data scientists will work on. So yes, getting in the door as a data scientist requires having enough math/programming so you can at least understand what a senior scientist wants you to do. But if you want to lead a team in a technical capacity, you can't possibly imagine doing that without really knowing the math or having a brilliant track record.

In other words, being a Data Scientist requires performing "scientific" activities. Most of these articles focuses on the "Data" part and very rarely about "science". Science requires excellent scholarship - synthesis of current situation and putting future proposals in context of the current "jigsaw puzzle". Now, that is a skill that you can not learn by reading - you need to practice it for someone period of time. No wonder, Ph.D is a requirement for such positions.


> If you’ve completed a math degree or some other degree that provides an emphasis on quantitative skills, you’re probably wondering if everything you learned to get your degree was necessary.

It should be emphasized that obtaining a degree in a subject like math and physics should "form" your way of thinking in a certain way: That you are able to apply methods to other fields, such as data science. That's way more important then remembering "everything you learned". You might never do a Fourier transformation by hand again, but you know the concept. And you can apply it.


I've met plenty of bootcamp graduates who "know the math" but still can't move from rote recitation to intuitive relation. You can ask for the form of the loss function for "classification" (i.e., the usual cross-entropy) and get it every time. But it is apparent that without a lot of time (which becomes significantly shorter with prior physics/compsci/STEM experience) they can't look at the various tools as building blocks and assemble them into novel machine learning systems. They only understand inductive bias as it pertains to answering trivia about the differences between CNNs and RNNs. Disclaimer to say this is an anecdote but one that is reflected across a number of my colleagues, most of whom I trust not to fall prey to numerous cognitive biases.


Hmm decent math proficiency is necessary to understand data science, but I don’t think it’s much of a differentiator.

Communication and critical thinking, not unlike many jobs, are the more important things. IMO many data scientists suck at the former.


Being able to do your job is always the number 1 requirement of doing your job. Being able to communicate and think critically are nice-to-haves. A data scientist who doesn't know data science is useless.


I think this article was written for SEO.

Calculus, linear algebra, and probability would cover pretty much all of the algorithms mentioned in the article.


That to data science is like McDonald's to food.


Your comment was downvoted presumably because it is flip, but I think there's a kernel of truth here. The blog post correctly highlights that limited math is needed to "do" data science, but this is why we have so many irresponsible data scientists.

I'm currently teaching a data science class intended for mid-career professionals. It requires no college level math. And virtually every topic, I feel like I am doing them an injustice by not being able to delve into the mathematics behind what we're doing. They'll be able to put on their CVs that they "do data science", but I don't trust that my instruction gets them to the level that they really understand much about it.




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: