
Common statistical tests are linear models - homarp
https://lindeloev.github.io/tests-as-linear/
======
abhinai
_" Unfortunately, stats intro courses are usually taught as if each test is an
independent tool, needlessly making life more complicated for students and
teachers alike."_

Sadly enough this was true for so many other parts of my education.

~~~
fny
Shout out to Dr. Jerry Reiter at Duke who emphasized this in every stats class
I took with him!

------
joker3
This really ought to be better known. I think a large part of the reason that
it isn't is because most books that cover linear models in general require a
background in linear algebra, and there are very few people teaching from that
standpoint outside of the advanced undergraduate/beginning graduate level.

Another thing that I wish was more widely known is that a linear model is
linear in its parameters, not the data. You can apply arbitrary
transformations to the data and still have a linear model as long as what
you're fitting is of the form Ey = \beta_0 + \beta_1 f_1(X) + \beta_2 f_2(X) +
... + \beta_p f_p(X).

~~~
bertomartin
The idea that linear models are linear in the parameters and not the data is a
bit confusing. I know the effect of this is that you fit curves with "linear"
models, but I don't feel like I fully understand this. Can you explain further
or link to some good resources?

~~~
dbieber
Each data point is a bunch of features x_1, x_2, ..., x_n. You can make new
features for your data points using whatever functions you like -- it doesn't
matter if they're linear. Let's say we add two new features x_{n+1} = f(x_1,
x_2) and x_{n+2} = g(x_2, x_3).

Now if we train a linear model on the new expanded set of features, it's
linear in those features. It's not linear in the original data though, because
of the new features that we introduced: x_{n+1} and x_{n+2}.

------
viraptor
I'm really interested in this, but wish someone wrote a matching "how to learn
stats". This seems aimed at people who already have stats knowledge and
experience with R, so there's lots of "you know what I'm talking about"
assumptions.

~~~
cschmidt
I recommend The Cartoon Guide to Statistics, as a good way to get into stats.
I've taught a class using it work a number of times, and it makes a great
first exposure. The equations are there, explained in a really nice way.

[https://www.amazon.com/Cartoon-Guide-Statistics-Larry-
Gonick...](https://www.amazon.com/Cartoon-Guide-Statistics-Larry-
Gonick/dp/0062731025)

------
louden
When I was learning undergraduate mathematical statistics, my professor made
it a point to connect the links between linear models and many different
tests.

In the back of Statistical Inference by Casella and Berger, there is a great
chart making similar connections between the most common statistical
distributions.

There is a lot of rhyming in statistics.

~~~
clircle
A version of that great chart mentioned in the comment above can be found on
page 3 of this pdf:
[http://www.stat.rice.edu/~dobelman/courses/texts/leemis.dist...](http://www.stat.rice.edu/~dobelman/courses/texts/leemis.distributions.2008amstat.pdf)

~~~
laichzeit0
This is awesome. Thanks for sharing.

------
ineedasername
A very good intro. Machine learning etc. is fantastic and what can be done
with newer techniques blows me away. But I think the hype obscures the results
that can be obtained much more easily with more basic methods, as a precursor
to those advanced methods.

------
thanatropism
In econometrics training you learn to use the classical linear model and never
even learn the names for the parametric tests. It weirds me out that some
people who're math-literate enough to know calculus 101 wouldn't instantly get
that to test if two groups you would regress with a dummy, or to test if
slopes between groups are different you would regress with an interaction
effect. I remember it feeling as natural as water-is-wet by the end of the
semester.

This must be what it felt to learn multiplication with Roman numerals.

~~~
pas
What does it mean to "regress with a dummy"? (Also what does regress with mean
in general?) And two groups of what? Points? Also, what is an interaction
effect? :o

~~~
tomrod
Jargon only.

Regress with a dummy means find a set of coefficients which minimize the sum
of squared errors between a column (or columns) of a dependent variable from
linear combinations of independent variables. The "dummy" is a 1/0 flag
identifying whether the observation row belongs to a category, e.g. gender
(assuming a binary classification) with Female mapped to 1. The coefficient
that results from this algorithm will be the effect difference for a Female,
and the intercept will represent the Male group. Simple model: dependent
variable is wage, independent variable is years worked, and dummy represents
female. The model is then

Wage = b + m * years worked + c * 1(gender = F).

The intercept term b (0 years worked) will be the average male wage with 0
years worked. b+c, the coefficient for female, will be the average female wage
for 0 years worked.

With the regression, you get some nice things out like confidence intervals
around your coefficient estimates and more. So you can test hypotheses like
whether their is a wage gap (c=0) or how large the wage gap might be (c=any
number).

Obviously a canned example, but this helps to introduce all but your last
question. You can also test whether there is a slope effect. That is called an
interaction term: you add an independent variable to the model that multiplies
the gender coded variable to years worked. This would then show, if
significant, that you can rule out a hypothesis that there is no wage
differential between males and females for years worked (i.e. not just two
parallel lines, but different slopes entirely).

~~~
thanatropism
Great reply. Thanks for spreading knowledge.

------
nerdponx
Yes, a thousand times. The way ANOVA is taught is especially egregious.

~~~
oarabbus_
Why is this, and what is the proper way to learn it?

------
lenticular
The zoo of frequentist statistical tests in the article tend to give results
that are easily interpreted in misleading or wrong ways (but kudos to the
author to prominently include Bayesian methods). Uncritical usage of these
techniques is the #1 reason behind the replication crisis in science.

Bayesian methods should be preferred as the default.

~~~
Xcelerate
Could you elaborate a bit? How do Bayesian methods avoid misinterpretation?

~~~
pps43
If 10 people live in a village and none are diagnosed with cancer, then cancer
incidence in that village is 0%.

~~~
eanzenberg
Not sure if joking..

~~~
pps43
Well, events over trials is frequentist definition. Unbiased estimate with
nice statistical properties.

You know that it's an underestimate because you have prior knowledge about
cancer incidence. Bayesian methods let you incorporate that knowledge into
estimation process, pulling the estimate up towards a more realistic value.

~~~
mr_toad
> nice statistical properties.

Undefined variance isn’t nice.

------
chrisshroba
This seems like a great guide for how to teach stats! Does anyone have a
corresponding guide for how to learn stats this way? I have almost no
experience and would love to learn it in this “most tests are just linear
regressions” way that the author claims is so intuitive!

~~~
bunderbunder
I was rather fond of _Discovering Statistics using R_. It doesn't drive this
point about linear models home quite so lucidly, and the author's sense of
humor can be off-putting to some. But it generally does a pretty decent job of
explaining the logic and motivations behind different statistical models and
tests, and remains one of the more readable math textbooks I've ever trudged
through.

------
tenkabuto
This is very cool! I hope that with time someone will produce a version with
Python code examples (though there's now so many graphing libraries for Python
that assuredly not everyone would be pleased).

------
Myrmornis
This looks fantastic. Two ideas are

\- An automated test suite verifying the (near) equivalences.

\- Might be fun for someone to provide parallel implementations in other
linear modeling frameworks (scikit-learn, julia, etc).

------
MarkMyWordsMan
The site is down.

------
tomrod
Hey, I posted this last night[0]. Every time I think I know how HN works it
surprises me again :)

I absolutely loved this article, and am happy to see it get airtime on HN.

[0]
[https://news.ycombinator.com/item?id=19509107](https://news.ycombinator.com/item?id=19509107)

~~~
bibyte
I was also surprised to learn that HN allowed this. Even more surprising is
the fact that sometimes it doesn't allow to re-post posts that are months old.
I wish it was more clearly documented in the FAQ.

~~~
sctb
> _If a story has had significant attention in the last year or so, we kill
> reposts as duplicates. If not, a small number of reposts is ok._

[https://news.ycombinator.com/newsfaq.html](https://news.ycombinator.com/newsfaq.html)

This is what's going on. The previous submission didn't get any attention
(votes, discussion) so the repost was allowed after a certain time threshold
(about 8 hours).

~~~
tomrod
I appreciate the clarification.

