Hacker News new | past | comments | ask | show | jobs | submit login
Common statistical tests are linear models (lindeloev.github.io)
331 points by homarp on March 28, 2019 | hide | past | favorite | 44 comments

"Unfortunately, stats intro courses are usually taught as if each test is an independent tool, needlessly making life more complicated for students and teachers alike."

Sadly enough this was true for so many other parts of my education.

Shout out to Dr. Jerry Reiter at Duke who emphasized this in every stats class I took with him!

This really ought to be better known. I think a large part of the reason that it isn't is because most books that cover linear models in general require a background in linear algebra, and there are very few people teaching from that standpoint outside of the advanced undergraduate/beginning graduate level.

Another thing that I wish was more widely known is that a linear model is linear in its parameters, not the data. You can apply arbitrary transformations to the data and still have a linear model as long as what you're fitting is of the form Ey = \beta_0 + \beta_1 f_1(X) + \beta_2 f_2(X) + ... + \beta_p f_p(X).

The idea that linear models are linear in the parameters and not the data is a bit confusing. I know the effect of this is that you fit curves with "linear" models, but I don't feel like I fully understand this. Can you explain further or link to some good resources?

Each data point is a bunch of features x_1, x_2, ..., x_n. You can make new features for your data points using whatever functions you like -- it doesn't matter if they're linear. Let's say we add two new features x_{n+1} = f(x_1, x_2) and x_{n+2} = g(x_2, x_3).

Now if we train a linear model on the new expanded set of features, it's linear in those features. It's not linear in the original data though, because of the new features that we introduced: x_{n+1} and x_{n+2}.

I'm really interested in this, but wish someone wrote a matching "how to learn stats". This seems aimed at people who already have stats knowledge and experience with R, so there's lots of "you know what I'm talking about" assumptions.

I recommend The Cartoon Guide to Statistics, as a good way to get into stats. I've taught a class using it work a number of times, and it makes a great first exposure. The equations are there, explained in a really nice way.


That seems like it's _always_ how it is with stats... I find it very easy to get lost in stats presentations and articles with my STAT100-esque-understanding of simple random vars/distributions/PMF+PDFs

When I was learning undergraduate mathematical statistics, my professor made it a point to connect the links between linear models and many different tests.

In the back of Statistical Inference by Casella and Berger, there is a great chart making similar connections between the most common statistical distributions.

There is a lot of rhyming in statistics.

A version of that great chart mentioned in the comment above can be found on page 3 of this pdf: http://www.stat.rice.edu/~dobelman/courses/texts/leemis.dist...

This is awesome. Thanks for sharing.

A very good intro. Machine learning etc. is fantastic and what can be done with newer techniques blows me away. But I think the hype obscures the results that can be obtained much more easily with more basic methods, as a precursor to those advanced methods.

In econometrics training you learn to use the classical linear model and never even learn the names for the parametric tests. It weirds me out that some people who're math-literate enough to know calculus 101 wouldn't instantly get that to test if two groups you would regress with a dummy, or to test if slopes between groups are different you would regress with an interaction effect. I remember it feeling as natural as water-is-wet by the end of the semester.

This must be what it felt to learn multiplication with Roman numerals.

What does it mean to "regress with a dummy"? (Also what does regress with mean in general?) And two groups of what? Points? Also, what is an interaction effect? :o

Jargon only.

Regress with a dummy means find a set of coefficients which minimize the sum of squared errors between a column (or columns) of a dependent variable from linear combinations of independent variables. The "dummy" is a 1/0 flag identifying whether the observation row belongs to a category, e.g. gender (assuming a binary classification) with Female mapped to 1. The coefficient that results from this algorithm will be the effect difference for a Female, and the intercept will represent the Male group. Simple model: dependent variable is wage, independent variable is years worked, and dummy represents female. The model is then

Wage = b + m * years worked + c * 1(gender = F).

The intercept term b (0 years worked) will be the average male wage with 0 years worked. b+c, the coefficient for female, will be the average female wage for 0 years worked.

With the regression, you get some nice things out like confidence intervals around your coefficient estimates and more. So you can test hypotheses like whether their is a wage gap (c=0) or how large the wage gap might be (c=any number).

Obviously a canned example, but this helps to introduce all but your last question. You can also test whether there is a slope effect. That is called an interaction term: you add an independent variable to the model that multiplies the gender coded variable to years worked. This would then show, if significant, that you can rule out a hypothesis that there is no wage differential between males and females for years worked (i.e. not just two parallel lines, but different slopes entirely).

Great reply. Thanks for spreading knowledge.

Yes, a thousand times. The way ANOVA is taught is especially egregious.

Why is this, and what is the proper way to learn it?

The zoo of frequentist statistical tests in the article tend to give results that are easily interpreted in misleading or wrong ways (but kudos to the author to prominently include Bayesian methods). Uncritical usage of these techniques is the #1 reason behind the replication crisis in science.

Bayesian methods should be preferred as the default.

I agree that we should take a harder look at Bayesian methods, but I don't think that sweeping comments like this do much to advance the discussion. Any circumspect person will naturally be suspicious of something that is held up as a panacea.

It's much more constructive (and compelling) to call attention to specific problems with how frequentist methods get used in practice, and talk about how Bayesian methods can help with that problem.

For example, here's Andrew Gelman talking about multiple comparison bias: https://statmodeling.stat.columbia.edu/2016/08/22/bayesian-i...

Whether or not they should be preferred, there is massive institutional and knowledge-related inertia behind frequentist approaches. Rather than throw out the baby with the bathwater, throw out all legacy code and retrain millions of scientists in Bayesian, we should consistently urge better use of frequentist approaches.

To transition the next generation of scientists to Bayesian will require universities/stats depts/other depts to do so. Have you seen this to be the case right now? Because in my experience of a few different research settings, almost no one I know, of around 50+ researchers, except those passionate about statistics (very rare person indeed) is using Bayesian methodology.

How to improve Bayesian knowledge in the world? This has its own set of challenges, as Bayesian thinking is arguably more mathematically challenging for most numerophobic people. IMO, it will take multiple generations of effort to transition everyone over.

> ... We should consistently urge better use of frequentist approaches.

Many "frequentist" methods can be rephrased as heavily-simplified special cases of the Bayesian approach. Most easily, by assuming a flat prior distribution for the relevant parameters (which is basically an artifact of the parameterization you choose anyway) you can assert that any MLE-based approach will yield a correct Bayesian posterior mode, which in turn is an optimal Bayes-estimator, assuming a constant loss function.

The limits of this whole approach are fairly clear, of course - for one thing, Bayesian stats is generally based on working with the entire posterior distribution, not merely a point estimate of it - and there are good reasons for this. But the basic point stands, and many "tweaks" on the basic frequentist approach can in turn be justified in Bayesian terms. This is not to say that frequentist statistics is all that we'll ever need, but merely pointing out that this whole argument of "we should work with what we have, and focus on making sure that frequentist approaches are put to good use" actually sounds rather vacuous. It's correct in a very limited sense, and hardly something that Bayes proponents are unaware of!

Frequentist stats just aren't that useful for experimental results because they don't tell folks what they need to know. Even ignoring that, researchers understanding of even basic concepts like P-values is so bad that one might as well throw out everything, baby and all. In one survey [0], 94% of Spanish academic psychologists believed in some form of the inverse probability fallacy. A similar survey in Italy showed similar results.

Anecdotally, I can definitely say that the vast majority of American physical scientists I've talked to about this also fell prey to this. With the inverse probability fallacy, one attaches a Bayesian interpretation to P-values, so I don't think training scientists in Bayesian stats would be harder than getting them to use frequentist stats correctly (which is a minefield in comparison).

[0] https://www.researchgate.net/publication/280580018_Interpret...

> Uncritical usage of these techniques is the #1 reason behind the replication crisis in science.

Do you have any evidence for this? My gut feeling is representativeness is a much bigger problem. Like you do a study on people of the same age race class during the same zeitgeist and then generalize to an eternal law.

The next year fashion has changed and the replication comes out all different.

Representiveness is an issue, but I think the most important are the multiple comparisons problem and the base-rate fallacy. It's commonly the case that a P<0.05 positive result has more than a 50% chance of being wrong. The replication studies I've seen mainly point to these factors being the issue. The fundamental thread is neglect of prior information.

Could you elaborate a bit? How do Bayesian methods avoid misinterpretation?

By being explicit about the assumptions they make. It does not mean they are not biaised, but that you at least know what poison you chose.

Frequentist treat parameter as point estimate.

Bayesian treat parameter as a distribution.

The point estimate is base on sample space where as the parameter distribution is base the parameter space.

I think learning both is good and people who pit those two school of statistic against each other are a bit too zealot. They're both tools and use them as needed and when one is easier than the other.

If 10 people live in a village and none are diagnosed with cancer, then cancer incidence in that village is 0%.

That doesn't mean that 0% is your best estimate of the future cancer rate. Even frequentists can look for better methods than the raw frequency.

Do we know if being diagnosed with cancer is the same as actually having cancer?

Not sure if joking..

Well, events over trials is frequentist definition. Unbiased estimate with nice statistical properties.

You know that it's an underestimate because you have prior knowledge about cancer incidence. Bayesian methods let you incorporate that knowledge into estimation process, pulling the estimate up towards a more realistic value.

> nice statistical properties.

Undefined variance isn’t nice.

This seems like a great guide for how to teach stats! Does anyone have a corresponding guide for how to learn stats this way? I have almost no experience and would love to learn it in this “most tests are just linear regressions” way that the author claims is so intuitive!

I was rather fond of Discovering Statistics using R. It doesn't drive this point about linear models home quite so lucidly, and the author's sense of humor can be off-putting to some. But it generally does a pretty decent job of explaining the logic and motivations behind different statistical models and tests, and remains one of the more readable math textbooks I've ever trudged through.

As with so much in statistics, there is a giant gap when it comes to material to actually learn the subject usefully.

This is very cool! I hope that with time someone will produce a version with Python code examples (though there's now so many graphing libraries for Python that assuredly not everyone would be pleased).

This looks fantastic. Two ideas are

- An automated test suite verifying the (near) equivalences.

- Might be fun for someone to provide parallel implementations in other linear modeling frameworks (scikit-learn, julia, etc).

The site is down.

Hey, I posted this last night[0]. Every time I think I know how HN works it surprises me again :)

I absolutely loved this article, and am happy to see it get airtime on HN.

[0] https://news.ycombinator.com/item?id=19509107

I was also surprised to learn that HN allowed this. Even more surprising is the fact that sometimes it doesn't allow to re-post posts that are months old. I wish it was more clearly documented in the FAQ.

> If a story has had significant attention in the last year or so, we kill reposts as duplicates. If not, a small number of reposts is ok.


This is what's going on. The previous submission didn't get any attention (votes, discussion) so the repost was allowed after a certain time threshold (about 8 hours).

I appreciate the clarification.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact