
The Little Handbook of Statistical Practice (2012) - Anon84
http://www.jerrydallal.com/LHSP/LHSP.htm
======
mikorym
I've often told people in the passing that "if you have 20 parameters and
p=0.05 you should expect with random data to have something register as
significant". Looks like OP has beaten me to it to illustrate this. [1]

[1]
[http://www.jerrydallal.com/LHSP/coffee.htm](http://www.jerrydallal.com/LHSP/coffee.htm)

~~~
Nokinside
If you have 20 parameters, you should probably compare the effect sizes first.
If there are no outliers, p=0.05 or p=0.001 can't make the result important.

[https://www.tandfonline.com/doi/full/10.1080/00031305.2019.1...](https://www.tandfonline.com/doi/full/10.1080/00031305.2019.1583913)

> no p-value can reveal the plausibility, presence, truth, or importance of an
> association or effect. Therefore, a label of statistical significance does
> not mean or imply that an association or effect is highly probable, real,
> true, or important. Nor does a label of statistical nonsignificance lead to
> the association or effect being improbable, absent, false, or unimportant.
> Yet the dichotomization into “significant” and “not significant” is taken as
> an imprimatur of authority on these characteristics. In a world without
> bright lines, on the other hand, it becomes untenable to assert dramatic
> differences in interpretation from inconsequential differences in estimates.
> As Gelman and Stern (2006 Gelman, A., and Stern, H. (2006), “The Difference
> Between ‘Significant’ and ‘Not Significant’ Is Not Itself Statistically
> Significant,” The American Statistician, 60, 328–331. DOI:
> 10.1198/000313006X152649. ) famously observed, the difference between
> “significant” and “not significant” is not itself statistically significant.

~~~
brookhaven_dude
What do you say about neural networks with thousands of parameters?

~~~
atlasair
Neural networks are just used for prediction, not inference.

------
sevensor
This comes at a good time! I was just talking with my spouse about why you put
n-1 in the denominator when you calculate variance. I always liked "because
you used up a degree of freedom calculating the mean," but I feel like that's
kind of hand-wavy. (Obviously, I'm not a statistician.)

~~~
clircle
You can argue either way. dividing by n is the maximum likelihood estimator,
but dividing by n-1 is the unbiased estimator. Depends on what you like. Most
people prefer unbiased estimators when they are available.

~~~
sanderjd
What does it mean for it to be "unbiased"? What does it mean to "use up" a
degree of freedom?

I don't mind if things just can't really be explained intuitively because they
are fundamentally technical, but your explanation and the parent's both do
this thing where it sounds like it's explaining things in plain common
language, but isn't actually because it isn't clear what those plain words
mean in this context.

~~~
absherwin
Unbiased means that if I draw infinitely many random samples from a population
and average a statistic (in this case standard deviation) across all the
samples, the answer will be the statistic computed from the population itself.
If one divides by n instead of n-1, the estimate for standard deviation will
be be (n-1)/n too small. One reading this might think, "Wait! We're going to
infinity so the ratio converges to 1." That's true if the size of each sample
also goes to infinity but not if we draw millions of ten item samples.

As for using up a degree of freedom, the easiest way to build intuition for
why this is a useful concept is to think about very small samples. Let's say I
draw a sample of 1 item. By definition the item is equal to the mean so I
receive no information about the standard deviation. Conversely, if someone
had told me the mean in advance, I could learn a bit about the standard
deviation with a single sample. This carries on beyond one in diminishing
amounts. Imagine I draw two items. There's some probability that they're both
on the same side of the mean, in that case, I'll estimate my sample mean as
being between those number and underestimate the standard deviation. Note that
I'd still underestimate it even with the bias correction, it's just that that
factor compensates just enough that it balances out over all cases.

A simple, concrete way to convince yourself that this is real is to consider
the standard deviation of a variable that has an equal probability of being 1
or 0. The standard deviation is 0.5. But if we randomly sample two items, 50%
of the time they'll be the same and we'll estimate the standard deviation as
zero. The other 50% of the time, we'll get the right answer. Hence, our
average is half the right answer (n/(n-1)=2/1). The correction makes the
standard deviation double what it should be half the same while remaining zero
in the other cases. This also suggests why dividing by n is referred to as a
the maximum likelihood estimator.

~~~
krychu
This is very helpful, especially the example at the end, thanks. I think the
difficult part to understand is that dividing by n leads to an estimate that
is somehow too small. The intuition tells you that dividing by n would just
give you the true average.

------
jfim
Just as a warning, while the book mentions multiple comparisons, it's possible
to read the book in such a way that one would skip that section.

The section called "A Valuable Lesson" does show that doing multiple tests
with the same threshold of P<0.05 does cause inexistant effects to be reported
as statistically significant, but the section on correcting for that is
present much later in the section about ANOVA.

That's actually a pretty severe flaw, especially for a handbook that is likely
to be read partially.

------
talson
I just finished up a class that covered this material. Is there a good next
step or book for someone interested in data science.

~~~
subroutine
What software / lang are you currently using to analyze data?

~~~
talson
It used excel and Python. I’ve also personally used R outside of the class.

~~~
subroutine
That's good you got exposure to Python (and, I assume, numpy/scipy/pandas
etc.), and you're already familiar with R. Are you majoring in data science,
and just looking for something extra?

If you're interested in machine learning, Andrew Ng Coursera course is almost
a right of passage at this point - it's very accessible:
[https://www.coursera.org/learn/machine-
learning](https://www.coursera.org/learn/machine-learning)

Kutner is the bible on regression models (tho, not a super fun read):
[https://www.amazon.com/dp/0073014664](https://www.amazon.com/dp/0073014664)

This was one of my favorites as an undergrad:
[https://www.amazon.com/gp/product/0805833889](https://www.amazon.com/gp/product/0805833889)

~~~
talson
I’ll check these out, thanks for the recommendations. I’m not majoring in Data
science, but I do find it really interesting and want to learn more.

------
chefschef
I love tiny, reference-able resources like this!

------
wazoox
The anouncement on top of the page[0] makes me sad. I don't want to use a
Kindle, neither files in this format. Why no ePub option?

[0]:[http://www.jerrydallal.com/LHSP/kdp.htm](http://www.jerrydallal.com/LHSP/kdp.htm)

------
jackallis
beginner questions: how is information inhere better than others?

------
AdrienLemaire
Just sharing for information: written in 2001 and last modified in 2012. It's
great that some resources are not aging.

~~~
vondur
Well, statistics hasn’t changed much has it?

~~~
everybodyknows
At the level of application to scientific practice, yes, much has changed. For
a start:

[https://news.ycombinator.com/item?id=16859052](https://news.ycombinator.com/item?id=16859052)
[https://news.ycombinator.com/item?id=18577079](https://news.ycombinator.com/item?id=18577079)

~~~
tomrod
You have to understand the base to understand the crisis.

