
Statistics for Hackers - tomaskazemekas
https://speakerdeck.com/jakevdp/statistics-for-hackers
======
pavpanchekha
This talk was great, especially its focus on non-parametric techniques. A lot
of people are mislead by the traditional statistics approach, where it looks
like stats is about memorizing a relatively short list of (really scary)
formulas that let you compute one or another thing. People do not think about
how these formulas come about, and the simple ideas behind them.

On the other hand, I think there is not enough written about how to do
statistics with real, large, complex data sets. I tried to write something
like this
([https://pavpanchekha.com/blog/stats1.html](https://pavpanchekha.com/blog/stats1.html))
but of course the difficulty is that investigating a complex dataset is by
definition too complex to really fit into a blog post. In a large dataset, the
difficulties are the how and when you make decisions about _what_ to analyze.

~~~
rsy96
Every textbook on math I have ever read always start with intuition behind the
idea, and then present definitions and proofs. Only some reference books,
intended for look ups only, only enumerate formulas. I never understand why
anyone claims that they only learn math by memorizing formulas.

~~~
wodenokoto
Because they learn math in junior high where they really don't care about the
subject, so it gets boiled down to "remember this to pass the exam",
regardless of what the book says.

------
stdbrouw
For people who are interested in learning more about simulation, permutation
tests and bootstrapping, the three techniques that the slide deck discusses,
check out Allen Downey's Think Stats:
[http://greenteapress.com/thinkstats2/](http://greenteapress.com/thinkstats2/).

It really is a pity that introductions to statistics spend so much time on
analytic approximations and so little on the underlying concepts.

------
analognoise
Can we please stop putting "hack" on EVERYTHING?

It isn't a hack to present mathematics in an understandable way - it's a
pedagogical improvement for introductory works, not a "hack".

~~~
anewhnaccount
The hacking here refers to the approach of "just compute it" rather than using
an analytical approach.

~~~
analognoise
Then it's even worse than "not hacking" \- it's not useful.

Any idiot can run a simulation, or compute "something". "Something" is only
useful in context.

"If you can write a for loop, you can do statistics!" Ugh. Can nobody read
anymore? It has to be a "slide deck"?

Jesus, pick up a book and ask some goddamn questions if you're interested.

~~~
rankam
Why are you so angry? He gave a talk and released the slides because he
thought people may benefit from it - given Jake's (the speaker) background and
skills, I would say he's doing everyone a favor by publicly releasing this.

Calm down.

------
int3
What books would y'all recommend for someone who has taken a couple of
college-level proof courses but never took anything on probability and
statistics?

~~~
chestervonwinch
I've said this elsewhere, but I recommend Casella & Berger's Statistical
Inference. It will take you from probability theory to statistics... all the
basics. I found it to be a very readable text, and it is used for many first
graduate courses in stats - for me it was used for a 2 semester sequence first
focusing on probability, then on statistics.

~~~
stdbrouw
I like Casella and Berger, especially because it does so well at showing the
connection between probability and statistics (first five chapters or so), but
it should be noted that its approach is very different from the one in the
slide deck above. Casella/Berger is purely frequentist rather than
computational or Bayesian, and it spends a lot of time on likelihood theory
and treats e.g. bootstrapping only very summarily.

------
jrbapna
Anybody have a link to the actual talk?

~~~
TwoFx
There is none.
[https://twitter.com/jakevdp/status/644863206339379201](https://twitter.com/jakevdp/status/644863206339379201)

------
spenczar5
Bootstrapping deserves to be much better known. The fact that it works is
miraculous.

~~~
solomatov
It doesn't always work, it has its applicability but it's wide enough.

~~~
stdbrouw
"It doesn't always work" really is not a very useful thing to say unless you
elaborate on when exactly it doesn't work and why.

Bootstrapping doesn't work for very low n (e.g. n=10) because the resamples
are not smooth enough and it doesn't always work that well for estimating
quantiles. But analytic methods fare pretty poorly in these circumstances too
and in any case people are mostly interested in confidence intervals around
the mean anyway.

~~~
solomatov
You are assuming that distributions we work with are normal, and the sampled
values are independent. If you take a fat tailed distribution (for example
Cauchy), bootstrap wouldn't work. If values, are dependent it wouldn't work
either.

See also here: [http://stats.stackexchange.com/questions/172920/bootstrap-
me...](http://stats.stackexchange.com/questions/172920/bootstrap-method-
downsides) They give example of non pathologic distribution for which
bootstrap doesn't provide good estimate. It's unform(0, theta)

~~~
stdbrouw
That's nonsense. The entire point of the bootstrap is that it does not require
the either the original or the sampling distribution to be normal. (For that
matter, due to the Central Limit Theorem, neither do analytic approximations
like a t-test for comparing different group means.)

You misunderstand the point about the Cauchy distribution in the answer on
Cross Validated. The Cauchy distribution is a degenerate case, mostly
interesting as an academic toy because it has infinite variance. Of course
that's not going to fare well.

Dependent data can be tricky to deal with, but you can bootstrap such data by
removing the dependence, bootstrapping the independent data, and adding the
dependence back in. This sounds hard but is usually as easy as running a
regression and subtracting/adding the component (x*beta) that leads to the
dependency. Alternatively, for timeseries there's window methods.

Of course a short slide deck like the one linked to in this thread is not
going to teach you all the finer points of bootstrapping and you can
definitely do it wrong. But compared to all the assumptions that frequentist
statistics makes to generate confidence intervals and the fact that you need a
different method for every different scenario, bootstrapping is about as
robust and idiot-proof as it's going to get. "It has its applicability" is
beyond selling it short.

~~~
solomatov
> For that matter, due to the Central Limit Theorem, neither do analytic
> approximations like a t-test for comparing different group means.

Actually, it doesn't. You need a confidence interval, and central limit
theorem doesn't give you a confidence interval. It just says that the
distribution is close enough to normal at some point. In some cases, it might
take a very large n before it become close to normal.

~~~
stdbrouw
And exactly how long it will take for the approximation to reach a certain
level of accuracy can be ascertained by using the other technique mentioned in
the slide deck: run a simulation. And so we have come full circle :-)

------
minimaxir
The talk is very explicitly a tutorial for a few good statistical methods for
those without a statistical background, using simple tools "hackers" are
familiar with and are occasionally necessary, especially in the case of
bootstrapping. This isn't hacking in the sense of "growth hacking,"
thankfully.

------
hartator
Coin toss is a bad example to take as coins can't be loaded. However, very
interesting.

~~~
texthompson
Coins can be biased, they just need to be bent so far out of shape that it's
obvious. Here's a fun example:

[https://izbicki.me/blog/how-to-create-an-unfair-coin-and-
pro...](https://izbicki.me/blog/how-to-create-an-unfair-coin-and-prove-it-
with-math.html)

------
porter
This needs to be a course. I'd pay for it.

------
ziles88
The term hacker is just so overused at this point. Does it just mean someone
who is good at tech in the Bay?

~~~
code_sterling
How so? I see anyone who refuses to accept things as they are, coupled with an
innate curiosity and a drive towards getting to the root of a problem; and
finally changing the rules to make it fit their prerogative, a hacker.

George Bernard Shaw would refer to this individual as an unreasonable man, but
I prefer hacker.

