
Linear regression by hand - tarheeljason
https://dsgazette.com/2018/01/10/linear-regression-by-hand/
======
thaumaturgy
MySQL is perfectly capable of calculating a linear regression for you, btw. In
my case, I needed to be able to estimate trends from sparse time series data.
Here's how you do that:

    
    
        SELECT
            @a_count := avg(count) as mean_count,
            @a_weeks := avg(`week`) as mean_weeks,
            @covariance := (sum(`week` * `count`) - sum(`week`) * sum(`count`) / count(`week`)) / count(`week`) as covariance,
            @stddev_count := stddev(`count`) as stddev_count,
            @stddev_week := stddev(`week`) as stddev_week,
            @r := @covariance / (@stddev_count * @stddev_week) as r,
            @slope := @r * @stddev_count / @stddev_week as slope,
            @y_int := @a_count - (@slope * @a_weeks) as y_int,
            @this_week_no := timestampdiff(WEEK, (select min(`date`) from dataset), curdate()) as this_week_no,
            @predicted := round(greatest(1, @y_int + (@slope * @this_week_no))) as predicted
        
        FROM (SELECT timestampdiff(WEEK, (select min(`date`) from dataset), `date`) as week, count(date) as count FROM dataset group by WEEK(date)) series;
       

I had to figure out how to translate the math into SQL, now you don't have to.

This performs well enough to be able to crunch tens of millions of rows of
data in "reasonable time" on a wimpy VPS.

~~~
soVeryTired
Yes, but _why_? Right tool for the right job and all that...

~~~
barrkel
Normally the reason you write anything beyond trivial SQL is because you only
have a small amount of code to run and lots of data to run it over. Pushing
the code to the data is more efficient than pulling the data to the code.

The latter might be conceptually cleaner (though it's debatable, relational is
a fairly nice programming model and a lot more consistent and well-founded
than object orientation, for one), but it's seldom optimal.

Three orders of magnitude or more speedups are not unexpected by pushing the
code to the data.

~~~
thaumaturgy
You're basically spot on here. I have a bunch of rows that need to have trend
data crunched and updated pretty frequently. Putting this into MySQL cut the
server load quite a bit, and it isn't so much business logic that I feel bad
about doing it.

------
Rainymood
Can someone explain to me why this has (so many) upvotes? This is like
elementary undergraduate econ stats and kind of trivial?

There's very little content either, it's literally a reformulation of the
formula, no interesting graphs or geometric interpretation. What I expected
from a title like "Linear Regression By Hand" was the minimization of some
quadratic error function, by hand (i.e. using pencil and paper).

~~~
theophrastus
One issue that I nearly always find missing in intro discussions about linear
regression is the near universal assumption of no error in the abcissal/"x"
values. And while this is true-ish for time series data, (we know for
_certain_ which day we collected the data on - yet the same hour every day?),
I'd be rich if I had a nickel for every time I saw standard linear regression
done when the "x" had significant (and known) error. In which case you're
biasing yourself unless you use some sort of 2d regression, like Deming.[1]

[1]
[https://en.wikipedia.org/wiki/Deming_regression](https://en.wikipedia.org/wiki/Deming_regression)

~~~
VHRanger
Regression with measurement error is usually treated in much higher level
statistics/econometrics classes.

If you're interested in this you can read more in Mostly Harmless Econometrics
[1] about adressing this with IV methods

[1]
[http://www.development.wne.uw.edu.pl/uploads/Main/recrut_eco...](http://www.development.wne.uw.edu.pl/uploads/Main/recrut_econometrics.pdf)

~~~
pacbard
To build on this a little bit more, there are also generalized linear models
that allow to specify the reliability (i.e., error level) of a variable.

Regarding, 2SLS models, I find them more useful to account for endogeneity in
the model rather than measurement error. After all, measurement error is
usually unobserved (otherwise you would just take it out). 2SLS “just” reweigh
the point estimates by identifying the good variation in the instrumented
variable using the instrument (for example, using draft lottery results to
account for the endogenous choice to attend college).

------
dankohn1
I remember being blown away as an undergrad that least squares (which I had
learned first algrebraiclly) had such an obvious geometric meaning:

[http://www.statisticshowto.com/wp-
content/uploads/2014/11/le...](http://www.statisticshowto.com/wp-
content/uploads/2014/11/least-squares-regression-line.jpg)

You need to square the values so that points that positives and negative
differences (between the points and the trend regression line) don't cancel
out.

~~~
pliny
If you only needed to do that, you could just take the abs error (Even the 1-0
loss function, where every point that the regression hyperplane doesn't pass
through contributes 1 to the error, fulfills this criterion).

~~~
eat_veggies
I'm super new to statistics and math, so can you fill me in on why the error
is squared rather than absolute valued? Is it because it's easier to take the
derivative of, and therefore minimize analytically?

------
tw1010
It's way more fun to know how to derive least squares than to memorize some
formula:
[https://see.stanford.edu/materials/lsoeldsee263/05-ls.pdf](https://see.stanford.edu/materials/lsoeldsee263/05-ls.pdf)
(page 4)

~~~
xadhominemx
Also useful to understand least squares as a special case of maximum
likelihood estimation. MLE I think is very intuitive.

~~~
vecter
I know a bit of stats, can you explain why MLE is intuitive?

------
vecter
This is very dangerous and an awful way to compute the least squares fit due
to potential numerical issues with calculating the inverse of the matrix. I
wish he would put a warning in a huge bold header to _never do this for actual
production work_.

~~~
ak_yo
This is right -- plus lm() is faster! Although, from a statistical
perspective, if you can't invert X'X, that should first make you think "I have
data quality issues" (i.e. multicollinearity) rather than "I need a different
algorithm to compute the inverse".

------
wgyn
If you really want to do linear regression by hand, check out Chapter 1 of
Stephen Stigler's History of Statistics: [https://www.amazon.com/History-
Statistics-Measurement-Uncert...](https://www.amazon.com/History-Statistics-
Measurement-Uncertainty-before/dp/067440341X). You can do least squares on
astronomical data the way Legendre did it.

------
antirez
Very small fully connected neural networks are incredibly good at
approximating functions even after one second of training with RPROP. Of
course for complex non linear functions as well.

Btw doing linear regression with pencil and paper just geometrically tracing a
line that appears to fit the points and then calculating then coefficients is
trivial.

------
inlineint
I thought it was about doing scatter plot on graph paper and trying to draw a
line with a ruler so that it “almost all points fit”, then empirically
measuring the slope and the intercept. I had an impression that it was the way
to go in cases when the requirements for accuracy were not strict and
calculators was not around.

------
blt
No mention of gradient based solutions for huge data sets?

