
Guide to Linear Regression (2015) - alexhwoods
http://alexhwoods.com/guide-to-linear-regression/
======
theophrastus
Any summary of linear regression ought to at least point out that the "cost
function", which depending on the circumstances is often viewed as modeling
the variance in the data, is typically limited to one data coordinate. That
is, the separations to the fit line to be minimized go entirely along the
y-axis; which has the effect of assuming that there's perfect knowledge of the
x value. And while that can be the case for some sampling protocols, it is
also often not the case. So please consider including something like:

If there is uncertainty in both the x and y coordinates then one needs to
pursue alternate approaches which admit variation in both, one of the most
popular being "Deming regression"[1]

[1]
[https://en.wikipedia.org/wiki/Deming_regression](https://en.wikipedia.org/wiki/Deming_regression)

~~~
ced
Can you give examples? It seems to me that in my cases, I never use
X='temperature' as a predictor, but rather X='temperature reported by
thermometer', and this is of course always known with 100% accuracy. In a
predictive context, that's all we need. Are you referring to a social science
context?

~~~
theophrastus
Here's an easy one to conceptualize: A good friend of mine is a state
hydrologist. He crawls over streams with a ladder-like platform and at
increments lowers a depth gauge to eventually estimate the stream's volume
flow rate. For a number of reasons (including wild weather conditions) the
cross stream increment is known only to an inch or so, that's the uncertainty
in X, the depth gauge reading has another uncertainty in Y. These are applied
to a function where the slope, via Deming regression, gives him an estimate of
the flow rate. By the bye, most lab thermometers i've dealt with have an
accuracy of about 0.1°C

------
larrydag
I believe Frank Harrell's Regression Modeling Strategies is one of the best
references for performing regression analysis.

website:
[http://biostat.mc.vanderbilt.edu/wiki/Main/RmS](http://biostat.mc.vanderbilt.edu/wiki/Main/RmS)
Amazon book: [http://amzn.to/1UdPWOv](http://amzn.to/1UdPWOv)

~~~
stdbrouw
Yep, there is really no other book like Harrell. There's a 2015 edition with
some good improvements too. Definitely not an introduction, though – it
assumes you already know how regression works and preferably have already used
it quite a bit in practice. For an introduction, I might recommend Downey's
Think Stats [1] for a gentle introduction, Kutner's Applied Linear Statistical
Models for a more traditional, mathy introduction or Gelman and Hill's Data
Analysis Using Regression and Multilevel/Hierarchical [2] for something in-
depth but still very approachable.

[1]
[http://greenteapress.com/thinkstats2/](http://greenteapress.com/thinkstats2/)

[2]
[http://www.stat.columbia.edu/~gelman/arm/](http://www.stat.columbia.edu/~gelman/arm/)

~~~
larrydag
great points. it's not an intro to regression book. although Harrell's
examples are very well laid out and easy to follow.

------
peatmoss
Yikes, no discussion of how one might interpret this model, what its
assumptions are, and how to assess validity or fit? Look elsewhere--this is
not the guide to linear regression anyone should be looking for.

~~~
glaugh
We wrote a couple docs about regression. I'd be curious to get folks'
feedback.

Guide itself: [http://docs.statwing.com/user-friendly-guide-to-
regression/](http://docs.statwing.com/user-friendly-guide-to-regression/)

Perhaps more interestingly to a lot of folks in this crowd, a guide to
interpreting residuals: [http://docs.statwing.com/interpreting-residual-plots-
to-impr...](http://docs.statwing.com/interpreting-residual-plots-to-improve-
your-regression/)

One valid critique is that the approach we describe is not super rigorous,
there's more "add a variable and see if it sticks" and "explore a bunch of
bivariate relationships to decide what to include" then you'd want to do if
you were publishing a paper on your findings. To our minds, though, this
approach is more realistic for the use cases we're more concerned about, like
analyzing an ad hoc survey. You can always consider your results to be
exploratory and then validate later. (Note that these docs occasionally refer
to Statwing, our product, but really could be used for any tool).

Criticism is welcome.

~~~
fats_tromino
Just as two quick comments, confidence intervals != prediction intervals. One
gives a range around a true mean value, the other gives a range around where
the variable may actually fall (prediction intervals are bigger).

You may also want to mention adjusted R^2 as a measure of quality of the
model. I've never heard of AICR before, the standard metrics for quality of
the model are AIC/BIC (and sometimes Mallus Cp).

Edit: fixed sloppy wording about confidence interval.

~~~
glaugh
The prediction interval thing was sloppy, good catch.

The r^2 and AICR comments reflect the fact that these docs were built around
Statwing's regression capabilities, which default to m-estimation instead of
OLS. There's no adjusted r^2 for that, which I agree is a better measure when
available, and it uses AICR, where the R stands for "robust". But still a good
catch, since I didn't caveat ahead of time that we were only talking about
robust methods (and we don't note it in the docs).

Much appreciated.

~~~
hooloovoo_zoo
OLS is an M-estimator.

------
bigger_cheese
Multivariate statistics is something I really wish was covered better.

In my work as an engineer I use multiple linear regression from time to time.
The best method I've found is forward selection stepwise approach. I've never
really seen a simple explanation of how it works (it was introduced to me by a
statistician at my work) but it is useful when you have many variables and you
want to see how significant a single variable is relative to overall
regression. The impression I get is pure stats people really dislike stepwise
modelling.

More recently I've been looking into something called PCA (Principle Component
Analysis) and PLS (projection to latent structure) after I came across it in a
PHD thesis. I've yet to find a decent simple explanation about how it works
though. Unfortunately my work no longer employees a statistician.

------
hackaflocka
IMHO, a very confusing thing about Linear Regression is that there's
widespread disagreement about whether the data are supposed to be Normally
Distributed or not, and on how to measure said Normality.

~~~
stdbrouw
You're right, there is lots of confusion in the topic and not all texts on
regression get it right. Theoretically, linear regression depends only on a
normally distributed outcome variable, the predictors don't have to be
normally distributed. Practically speaking even that requirement can mostly be
ignored: OLS is fairly insensitive to non-normal outcomes. Violations of the
assumptions of regression usually don't affect the estimates much, they mainly
affect the uncertainty around those parameter estimates -- the standard
errors. This too is easily solved: use bootstrapping to calculate the standard
errors.

In short: you can't just throw linear regression at anything and expect to get
reasonable results, but it's pretty damn close.

~~~
hooloovoo_zoo
Even that normality is not required; see, for instance,
[https://en.wikipedia.org/wiki/Gauss%E2%80%93Markov_theorem](https://en.wikipedia.org/wiki/Gauss%E2%80%93Markov_theorem).

~~~
stdbrouw
Good catch. Outcome (or error, which boils down to the same thing) can't just
be anything either, though, e.g. a common pattern is that the variance in an
outcome increases when the predictor increases (e.g. a machine behaves more
erratically when it spins faster) and that can mess with estimates.

