Hacker News new | comments | show | ask | jobs | submit login
Linear regression by hand (dsgazette.com)
129 points by tarheeljason 10 months ago | hide | past | web | favorite | 34 comments

MySQL is perfectly capable of calculating a linear regression for you, btw. In my case, I needed to be able to estimate trends from sparse time series data. Here's how you do that:

        @a_count := avg(count) as mean_count,
        @a_weeks := avg(`week`) as mean_weeks,
        @covariance := (sum(`week` * `count`) - sum(`week`) * sum(`count`) / count(`week`)) / count(`week`) as covariance,
        @stddev_count := stddev(`count`) as stddev_count,
        @stddev_week := stddev(`week`) as stddev_week,
        @r := @covariance / (@stddev_count * @stddev_week) as r,
        @slope := @r * @stddev_count / @stddev_week as slope,
        @y_int := @a_count - (@slope * @a_weeks) as y_int,
        @this_week_no := timestampdiff(WEEK, (select min(`date`) from dataset), curdate()) as this_week_no,
        @predicted := round(greatest(1, @y_int + (@slope * @this_week_no))) as predicted
    FROM (SELECT timestampdiff(WEEK, (select min(`date`) from dataset), `date`) as week, count(date) as count FROM dataset group by WEEK(date)) series;
I had to figure out how to translate the math into SQL, now you don't have to.

This performs well enough to be able to crunch tens of millions of rows of data in "reasonable time" on a wimpy VPS.

Yes, but why? Right tool for the right job and all that...

Normally the reason you write anything beyond trivial SQL is because you only have a small amount of code to run and lots of data to run it over. Pushing the code to the data is more efficient than pulling the data to the code.

The latter might be conceptually cleaner (though it's debatable, relational is a fairly nice programming model and a lot more consistent and well-founded than object orientation, for one), but it's seldom optimal.

Three orders of magnitude or more speedups are not unexpected by pushing the code to the data.

You're basically spot on here. I have a bunch of rows that need to have trend data crunched and updated pretty frequently. Putting this into MySQL cut the server load quite a bit, and it isn't so much business logic that I feel bad about doing it.

I would guess access to the data is a decent reason to try. No need to pump the data elsewhere, if this is available.

I share your doubt that this is worth it, to be clear.

In this case the amount of data and the frequency with which it needs to be updated made handling this in MySQL more practical. I otherwise would've had to have some process querying and updating the records outside of MySQL and it turns out that that puts a whole lot more load on the server than if I just ask MySQL to do it.

And besides, what happened to the hacker ethos? "Because I can" should be justification enough. :-)

Is is cool to hear that it works! And just because one doubts something, doesn't mean one shouldn't necessarily try it. Especially if I'm the one doubting. :)

Can someone explain to me why this has (so many) upvotes? This is like elementary undergraduate econ stats and kind of trivial?

There's very little content either, it's literally a reformulation of the formula, no interesting graphs or geometric interpretation. What I expected from a title like "Linear Regression By Hand" was the minimization of some quadratic error function, by hand (i.e. using pencil and paper).

My suspicion, for the Eeyores round here, is that this has upvotes because people like to bookmark a reminder of references to remind you of the maths behind what most of us do automagically nowadays, for when we might need it. And often it's the comments that provide even better and advance sources, as is the case here. And indeed there are some programmers new to this kind of thing.

I think a lot of people here didn't take econometrics courses.

If you see all regression problems under the foil of maximum likelihood estimation, you might not know that ordinary least squares regression has a closed form solution

One issue that I nearly always find missing in intro discussions about linear regression is the near universal assumption of no error in the abcissal/"x" values. And while this is true-ish for time series data, (we know for certain which day we collected the data on - yet the same hour every day?), I'd be rich if I had a nickel for every time I saw standard linear regression done when the "x" had significant (and known) error. In which case you're biasing yourself unless you use some sort of 2d regression, like Deming.[1]

[1] https://en.wikipedia.org/wiki/Deming_regression

Regression with measurement error is usually treated in much higher level statistics/econometrics classes.

If you're interested in this you can read more in Mostly Harmless Econometrics [1] about adressing this with IV methods

[1] http://www.development.wne.uw.edu.pl/uploads/Main/recrut_eco...

To build on this a little bit more, there are also generalized linear models that allow to specify the reliability (i.e., error level) of a variable.

Regarding, 2SLS models, I find them more useful to account for endogeneity in the model rather than measurement error. After all, measurement error is usually unobserved (otherwise you would just take it out). 2SLS “just” reweigh the point estimates by identifying the good variation in the instrumented variable using the instrument (for example, using draft lottery results to account for the endogenous choice to attend college).

I agree, it seems like no help just to list a formula to memorize. If someone knows enough linear algebra to understand what the formula represents, they can do the derivation. This link [1] is a good one if anyone is interested.

[1] https://eli.thegreenplace.net/2014/derivation-of-the-normal-...

What I expected "by hand" to mean was something like a handmade analog computer. E.g. print your scatterplot, tape a penny over each data point, push a tack through the origin and let the printout swing around it under gravity until it comes to rest (while keeping it flat -- I guess I should've first stuck the printout onto some cardboard). Is there some generalization of this idea that lets the intercept vary too?

Well, unless you're doing original research, everything you write will be trivial to somebody.

There aren’t that many good articles on basic statistics that are freely available on the net. Textbooks are expensive, and most of the free material is of dubious quality. Wikipedia is particularly terrible on Statistics.

It seems like blog posts on simple statistical methods like this one land on the front page of HNews a lot more than one would expect.


I almost certainly know something I'd consider "trivial" that you haven't encountered yet. I try to be really excited when that happens.

I think that perhaps the issue is that machine learning courses skip over the fact that there is a trivial closed form solution to 99.9% of all real-world machine learning problems.

I remember being blown away as an undergrad that least squares (which I had learned first algrebraiclly) had such an obvious geometric meaning:


You need to square the values so that points that positives and negative differences (between the points and the trend regression line) don't cancel out.

If you only needed to do that, you could just take the abs error (Even the 1-0 loss function, where every point that the regression hyperplane doesn't pass through contributes 1 to the error, fulfills this criterion).

I'm super new to statistics and math, so can you fill me in on why the error is squared rather than absolute valued? Is it because it's easier to take the derivative of, and therefore minimize analytically?

It's way more fun to know how to derive least squares than to memorize some formula: https://see.stanford.edu/materials/lsoeldsee263/05-ls.pdf (page 4)

Also useful to understand least squares as a special case of maximum likelihood estimation. MLE I think is very intuitive.

I know a bit of stats, can you explain why MLE is intuitive?


You're confusing linear regression and least squares. (They're connected but not identical in that way.) Least squares gives the closest orthagonal (perpendicular) projection onto the range of the matrix. The slide is correct.

This is very dangerous and an awful way to compute the least squares fit due to potential numerical issues with calculating the inverse of the matrix. I wish he would put a warning in a huge bold header to never do this for actual production work.

This is right -- plus lm() is faster! Although, from a statistical perspective, if you can't invert X'X, that should first make you think "I have data quality issues" (i.e. multicollinearity) rather than "I need a different algorithm to compute the inverse".

If you really want to do linear regression by hand, check out Chapter 1 of Stephen Stigler's History of Statistics: https://www.amazon.com/History-Statistics-Measurement-Uncert.... You can do least squares on astronomical data the way Legendre did it.

Very small fully connected neural networks are incredibly good at approximating functions even after one second of training with RPROP. Of course for complex non linear functions as well.

Btw doing linear regression with pencil and paper just geometrically tracing a line that appears to fit the points and then calculating then coefficients is trivial.

I thought it was about doing scatter plot on graph paper and trying to draw a line with a ruler so that it “almost all points fit”, then empirically measuring the slope and the intercept. I had an impression that it was the way to go in cases when the requirements for accuracy were not strict and calculators was not around.

No mention of gradient based solutions for huge data sets?

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact