
Linear regression - ingve
http://eli.thegreenplace.net/2016/linear-regression/
======
imurray
If you're new to learning about regression and want to use it, I would
prioritize the statistical issues. There are lots of computational details you
could read about, like in this post. However in simple cases, fitting least-
squares linear regression is one line of code:

    
    
        w_fit = X \ yy; % Matlab/Octave, X is NxD, yy is Nx1
        w_fit = np.linalg.lstsq(X, yy)[0] # NumPy equivalent
    

Knowing how to construct the "design matrix" X is the important bit. What are
your features, how will they be pre-processed, and given what you've done,
should you trust w_fit to be meaningful, or just useful for prediction?

If you've thrown in lots of features, hoping to get good predictions, you may
want to do "regularization". The computation is no harder for simple versions.
One way to do "L2 regularization" is to add some extra rows to X and yy before
fitting the weights:

    
    
        % Matlab/Octave... the NumPy is nearly as simple.
        [N, D] = size(X);
        lambda = 1; % regularization constant
        X = [X; eye(D)*sqrt(lambda)];
        y = [yy; zeros(D,1)]
    

Or you could use an actual stats package like R's libraries that will contain
a lot more care. Again, you need to understand the statistics, for example how
you should select lambda. Then you'll want to know how to diagnose your model
and whether you should trust it.

It's fun to work out the least squares maths and implement these things from
scratch. But it has been done, and it's not what matters for most
applications.

EDIT: a discussion with good links on linear models and beyond from the other
day
[https://news.ycombinator.com/item?id=12237998](https://news.ycombinator.com/item?id=12237998)

~~~
jimmyerf
Econometrician here.

I really hate that linear regression is being explained in "Machine Learning"
language. "Features", "design matrix", "prediction s".

~~~
cossatot
And many in the sciences are confused by 'endogenous' and 'exogenous', but
just say 'independent' and 'dependent'... Every field has its jargon and
econometrics has no primacy here; no reason to have a strong emotional
reaction to words you understand because they're not what you learned in
college.

------
thanatropism
The single most important fact about linear regression is the Fritsch-Waugh-
Lovell theorem. It's the whole reason why we're not scared as shit of using
linear models.

[https://en.wikipedia.org/wiki/Frisch%E2%80%93Waugh%E2%80%93L...](https://en.wikipedia.org/wiki/Frisch%E2%80%93Waugh%E2%80%93Lovell_theorem)

Basically, it means that omitting variables is ok (subject to terms and
conditions).

Assuming that IQ and shoe size are uncorrelated, but that income is a function
of IQ and shoe size, I can run a regression:

income = a + b * IQ + residuals

and trust my estimate of b. FWL theorem says that running that regression and
then

residuals = c + d * shoe size

is the same as running

income = (a+c) + b _IQ + d_ shoe size + eps

It's an amazing result and it's all based on linear algebra, no belief in
statistics required.

~~~
apathy
Eh, orthogonalizing variables is a necessary precondition for this. When was
the last time variables were totally uncorrelated and you didn't know it
beforehand? Unfortunately such cases are the exception rather than the rule,
and you lose interpretability by projecting into a low-rank orthogonal
subspace. Not that interpretability is necessarily the be-all and end-all, but
the "terms and conditions" are pretty important here!

~~~
thanatropism
> When was the last time variables were totally uncorrelated and you didn't
> know it beforehand

Whenever we have an actual theory of the underlying phenomena.

It's common in econometrics to impose orthogonality conditions on the
theoretical equations so structural parameters can be recovered; or so we can
filter partial correlations. For example: a regression of prices and
quantities sold will mix the effect of shifts in demand and in supply.
Solution: find something that correlated with demand quantities but not price
directly (for example, changes in consumers' wealth).

A realistic, much cooler example:

[http://homepage.univie.ac.at/robert.kunst/pan2011_pres_rabas...](http://homepage.univie.ac.at/robert.kunst/pan2011_pres_rabas.pdf)

~~~
smallnamespace
> no belief in statistics required

Yes, but now you have a statistical assumption: that your structural model
holds.

To make use of the Fritsch-Waugh-Lovell theorem, you must test that your
orthogonality assumption holds using statistics :)

------
theophrastus
There is one conceptually simple issue often missing in such nicely presented
write-ups (and which appears to be missing here): error in the abscissa
('x-axis') values. Things like time-series tend to dominate such analytics and
in such data collections it's typically assumed that the time-stamp data is of
suitable accuracy that any error there can be neglected. But there are many
other data sources where there is notable error in both 'x' and 'y' data for
which commonly employed linear regression doesn't allow, (quick example: my
friend the hydrologist collects flow rates in rivers at sample transverse
distances which are hard to be sure of as one is dangling above the raging
waters). As a respectable starting point for regression which allows for error
in both axes i'd recommend Deming regression:
[https://en.wikipedia.org/wiki/Deming_regression](https://en.wikipedia.org/wiki/Deming_regression)

~~~
geezerjay
Those cases are also handled by least squares regression, mainly variants of
total least squares regression.

[https://en.wikipedia.org/wiki/Total_least_squares](https://en.wikipedia.org/wiki/Total_least_squares)

------
capkutay
Anecdotal thought, I find practical guides and tutorials like this far more
useful for learning machine learning than any of the public MOOCs like Andrew
Ng's machine learning class. There's a lot of dense content shown in video
lectures and very little practice. I almost had to google every topic and find
a write-up like this to drive each concept home.

------
darkhorn
Why you are not testing independence of errors?
[http://people.duke.edu/~rnau/testing.htm](http://people.duke.edu/~rnau/testing.htm)

------
murbard2
A common gotcha with linear regression. Imagine that your data points look
like they roughly fit the shape of a tilted ellipsis. The linear regression
will _not_ run along the axis of the ellipsis, the slope will typically be
smaller. This is because you're minimizing the square distance along the y
axis, not the distance from a point to the line itself.

~~~
thanatropism
What you appear to want is "total least squares":

[https://en.wikipedia.org/wiki/Total_least_squares](https://en.wikipedia.org/wiki/Total_least_squares)

------
darkhorn
Why you are not testing for constant variance?
[https://onlinecourses.science.psu.edu/stat501/node/367](https://onlinecourses.science.psu.edu/stat501/node/367)

~~~
apathy
Because, to quote Box,

‘To make the preliminary test on variances is rather like putting to sea in a
rowing boat to find out whether conditions are sufficiently calm for an ocean
liner to leave port!’ (Box 1953)

Plot the residuals instead.

------
apathy
Ctrl-F "Gauss"

 _0 matches_

Ctrl-F "Markov"

 _0 matches_

The fuck? Only the single most beautiful result in the field. Please don't
misunderstand me, the presentation is great and it is quite practical, I just
wish people would remember that very smart individuals have been coming along
for pretty much the entirety of humanity's existence.

[https://en.wikipedia.org/wiki/Gauss%E2%80%93Markov_theorem](https://en.wikipedia.org/wiki/Gauss%E2%80%93Markov_theorem)

The proof is hinted at in the writeup, so why not link to it?

~~~
darkhorn
I don't understand why they down vote you. Probably because they think that
they know Statistics. Programmers at Hacker News always think that they are
the master mind of Statistics. In fact they know almost nothing about
Statistics. For example if you don't test for constand variance and
independance of variances then you cannot graduate from Department
ofStatistis. But here they down vote you for knowing real Statistics.

~~~
minimaxir
> But here they down vote you for knowing real Statistics.

That is quite literally a No True Scotsman argument.

GP was likely downvoted since the Ctrl-F meme adds nothing to the discussion.

~~~
apathy
Personally, I felt that the omission was rather important. When the first
year's worth of many statistics programs can be condensed into one beautiful
theorem, which is centuries old, I feel it's disrespectful to one's forebears
to ignore it.

However, I get the impression that once the meme sensor goes off, any links
remain unvisited and reading comprehension may suffer. So I learned something
here, too.

I'd edit the original to remove the meme, but since I can't do that at this
point, readers will have to try and control their Internet Police tendencies
and read a few sentences.

Or not. Rather a shame, hopefully the author of the piece will fix it and this
will all be superfluous.

~~~
eliben
[post author here]

FWIW I did not find your original comment particularly offensive, and didn't
downvote (I think years on the internet have made me resilient).

The Gauss-Markov theorem is very cool. However, as I discovered while doing
some reading for this article - whole _books_ were written on linear
regression. The amount of information to convey is almost limitless and I
didn't want to turn this one into a multi-part article. I simply wanted to
write down (code and math) the things I find most pertinent to understanding
how the regression works.

~~~
apathy
Fair enough! If you felt like adding it as a footnote, great, if it doesn't
fit the purpose, that's fine too. And to reiterate, it's a beautifully laid
out, concise, high-content piece of writing. I should have phrased it
differently; how it's said may matter more than what's said.

Essentially I wrote a reaction post but, given the amount of revising and
reanalysis/examination-of-assumptions I've been doing lately, I ought to have
said something like:

"This piece is an excellent tutorial on how to do linear regression, why
gradient descent is fundamental (far beyond OLS or fixed effects models), and
so forth. However, if you are interested in why all of these things are
possible in the multivariate case as well as the simple bivariate case, you
might find the Gauss-Markov theorem to be a 'light bulb' moment." Because
that's how it happened to me.

Cheers, and thanks for writing a great piece that I will be pointing others
towards.

~~~
eliben
Thanks for the feedback and the useful reference. I'll definitely add the
Gauss-Markov theorem to my to-look-in-depth-at-later list and may write up
about it at some future point.

