
Statistical Techniques Data Scientists Need to Master - lobo_tuerto
https://towardsdatascience.com/the-10-statistical-techniques-data-scientists-need-to-master-1ef6dbd531f7
======
cwyers
Somewhere between 40 and 60 percent of data science is dealing with stuff like
"some of the people collecting the data used 1 for yes and 0 for no, and some
of them used 'Y' for yes and 'N' for no, and some misread the question and put
in numbers, your job is to figure out which 1s mean yes and which mean that
the data is bad," and you can go your entire career in data science without
ever using a SVM. But sure, let's have yet another article on methods.

~~~
existencebox
When I opened this thread, this was downvoted. I can understand why, it's
kinda dismissive of the article. However, I can also understand the
frustration and would generally agree.

Context: Was a data scientist for a bigCo for a few years. 90% of my day to
day was cleaning, organizing, building pipelines for, building golden sets
for, messy data. I was in a _very fortunate_ position where I got to actually
develop some models from the ground up/do some "Creative Work" as it were, but
the vast majority of my workload was not stuff I had ever really been prepared
for despite a degree in stats.

I don't say this to paint one or the other as "more important/better/more
respectable" work, but compared to how much I read about stats vs. going elbow
deep in real world data munging, there's certainly a bulk of the former that's
not aligned with my use of each. My guess? More generalizable, well
encapsulated, easy to talk about. I'd be hard pressed to discuss "general
techniques of figuring out how to un-fuck whatever data was being collected in
that legacy telemetry system that really wasn't meant for what it's being used
for", but that sort of work for me at least proved to be the bulk of my
learning on the job.

~~~
fwdpropaganda
> When I opened this thread, this was downvoted. I can understand why, it's
> kinda dismissive of the article. However, I can also understand the
> frustration and would generally agree.

Although I didn't downvote, I often downvote comments that are negative on
knowledge. You know the type. If we're talking about CS they will argue that
you don't have to know algorithms and data structures to be a good developer,
because you can just go on stackexchange. If we're talking about data, they
will argue that knowing what a p-value is useless because the methodology is
flawed anyway. If we're talking about knowledge itself they will argue that
going to university is a waste of time. And of course, no matter the subject
matter, you never need "all that maths".

~~~
mlthoughts2018
I don't usually see many comments like this, which "downvote knowledge" as you
say.

Instead, I see comments arguing that the _emphasis_ placed on that knowledge
is wrong-headed or misplaced.

For example, knowledge about data structures is a great thing. But defining an
interview process with data structure trivia as an implicit standard for
hiring someone is bad, and is a good teachable moment for pointing out why
it's a silly way to filter candidates.

Similarly, detailed knowledge about advanced statistical methods is a great
thing. But it's silly to force candidates to fill resumes with buzzwords or
treat their stats knowledge like peacock feathers because those are the hoops
to get hired, when then in reality the work will be 90% data engineering and
devops. It's similarly a teachable moment to point out why an emphasis on
these methods is silly, unless you're in a business that actually plans to
give an employee projects and tasks that require these things.

I guess if I saw a comment that truly was dismissive of the knowledge, in and
of itself, then I would agree that is worth downvoting.

I honestly can say I've not seen that very often, though. Only comments
reacting to that way signalling effects are used to subvert the knowledge for
e.g. silly hiring trivia or data science credential.

~~~
madhadron
> Similarly, detailed knowledge about advanced statistical methods is a great
> thing.

We have to keep in mind, though, that nothing in the article is advanced
statistical methods. They're the very, very beginning. Being able to run a
simple linear regression or bootstrap a data set is equivalent to being able
to write fizzbuzz for a programmer.

~~~
mlthoughts2018
I’d be more inclined to say that understanding expected value and variance at
a baseline level is like fizzbuzz of statistics.

Basic regression at least requires understanding sampling standard error of
estimators, distinguishing p-values from the posterior probability a
hypothesis is true, transformation of random variables (like for understanding
why classical 95% confidence interval bounds are +/\- 1.96 * standard error)
and some linear algebra.

But I agree the methods discussed, with the exception maybe of support vector
machines, are all elementary statistical methods.

What’s sad is that if these represent the fizzbuzz of statistics, it means
most jobs in statistics only require you to do tasks easier than fizzbuzz
(cursory data cleaning & summary stats). Because in most jobs, dimensionality
reduction, SVMs, clustering and boosting would be seen as incredibly advanced,
scarce work projects for people to fight over who gets to work on them.

~~~
fwdpropaganda
I object to the comparison entirely. Basic statistical concepts are orders of
magnitude more complex than a loop with a couple of ifs.

As a software engineer you CAN'T fail fizzbuzz. Whereas in statistics I can
come up with problems that rely only on basics but are still counter-intuitive
enough that some practitioners will make mistakes.

~~~
madhadron
I used to work as a statistician. Now I work as a software engineer. The
comparison is valid between what is considered the basics to operate as a
professional.

------
curiousgal
Linera Regression

Classification

Resampling Methods

Subset Selection

Shrinkage

Dimension Reduction

Nonlinear Models

Tree-based methods

Support Vector Machines

Unsupervised learning

The main take away is that Machine Learning _is_ Statistics.

~~~
mr_toad
> The main take away is that Machine Learning is Statistics.

That and brute force computation. A lot of ML models use vastly more data and
bigger feature sets than was feasible when these techniques were originally
developed. Stepwise regression, lassos, cross validation, etc have become much
more popular as raw computing power has increased.

~~~
bllguo
thanks for this comment. I was reading the article thinking "do people really
use stepwise / all-subsets regression these days?", but your point is fair.

------
rawland
This blog post's topics are remarkably similar to Stanford's Introduction to
Statistical Learning's topic order.

<snip>

    
    
        Topics include
        * Overview of statistical learning
        * Linear regression
        * Classification
        * Resampling methods
        * Linear model selection and regularization
        * Moving beyond linearity
        * Tree-based methods
        * Support vector machines
        * Unsupervised learning
    

</snip>

source:
[https://online.stanford.edu/courses/stats216v-introduction-s...](https://online.stanford.edu/courses/stats216v-introduction-
statistical-learning)

There is a free (R-lang based) course, which covers this material by Trevor
Hastie: [https://online.stanford.edu/courses/sohs-ystatslearning-
stat...](https://online.stanford.edu/courses/sohs-ystatslearning-statistical-
learning-self-paced) \- it's mentioned in the blog post.

