 Linear regression 98 points by ingve on Aug 9, 2016 | hide | past | web | favorite | 37 comments If you're new to learning about regression and want to use it, I would prioritize the statistical issues. There are lots of computational details you could read about, like in this post. However in simple cases, fitting least-squares linear regression is one line of code:`````` w_fit = X \ yy; % Matlab/Octave, X is NxD, yy is Nx1 w_fit = np.linalg.lstsq(X, yy) # NumPy equivalent `````` Knowing how to construct the "design matrix" X is the important bit. What are your features, how will they be pre-processed, and given what you've done, should you trust w_fit to be meaningful, or just useful for prediction?If you've thrown in lots of features, hoping to get good predictions, you may want to do "regularization". The computation is no harder for simple versions. One way to do "L2 regularization" is to add some extra rows to X and yy before fitting the weights:`````` % Matlab/Octave... the NumPy is nearly as simple. [N, D] = size(X); lambda = 1; % regularization constant X = [X; eye(D)*sqrt(lambda)]; y = [yy; zeros(D,1)] `````` Or you could use an actual stats package like R's libraries that will contain a lot more care. Again, you need to understand the statistics, for example how you should select lambda. Then you'll want to know how to diagnose your model and whether you should trust it.It's fun to work out the least squares maths and implement these things from scratch. But it has been done, and it's not what matters for most applications.EDIT: a discussion with good links on linear models and beyond from the other day https://news.ycombinator.com/item?id=12237998 To add, here's a good standard text to help you learn to "diagnose your model" for bias & classical assumption violations. And how to fix it. >Then you'll want to know how to diagnose your model and whether you should trust it.Yes this x1000. Normally I rely on F Values to get an indication of how reliable the thing is - the package I use spits out a tonne of other information - Cook's D etc but most of the output is just noise to me at university all I was taught was T-test (I'm an engineer) and the (now retired) statistician at my work only gave me a very cursory explanation and he basically emphasized F value. Econometrician here.I really hate that linear regression is being explained in "Machine Learning" language. "Features", "design matrix", "prediction s". And many in the sciences are confused by 'endogenous' and 'exogenous', but just say 'independent' and 'dependent'... Every field has its jargon and econometrics has no primacy here; no reason to have a strong emotional reaction to words you understand because they're not what you learned in college. > If you're new to learning about regression and want to use it, I would prioritize the statistical issues.There is absolutely no need to take the statistical route in least squares regression. In fact, it needlessly complicates things. Interpreting least squares regression as the minimization of the sum of squares of the residual is a straight-forward way to understand the technique. I'm not sure what you mean, but I think we're talking past each other.Regression is a statistical term. If it's going to be used for an application involving data and real decisions, not understanding the statistical issues is irresponsible. That's not at odds with viewing it as a procedure to minimize square errors.To clarify: There are probabilistic models (involving Gaussians and stuff) where some interpretations suggest least-squares estimation. That wasn't the distinction I was trying to make. Statisticians who hate probabilistic models will still say that it's important to think about what you put in X and y before you do X\y, and to be careful about what you can conclude from the result. Statisticians who hate probabilistic models might want to find another line of work. Like, say, philosophy.You use the phrase "involving Gaussians and stuff" to "clarify," and you ask others for rigor? Physician, heal thyself! And please don't speak for statisticians, they are a peculiar bunch. Ask the same question of 2 statisticians and you may well end up with 3 answers. > Regression is a statistical term. If it's going to be used for an application involving data and real decisions, not understanding the statistical issues is irresponsible.No, it isn't.It's curve fitting. The goal is to fit a curve to the data.In any engineering application, no one cares what's the statistical interpretation. They only want to fit a curve to the data, and do it in an optimal way.Hence, you get the sum of squares of the difference between the curve and the sample point (i.e., the residual) and minimize it by determining the parameters which minimize the regression.That's it. That's a particularly incompetent and irresponsible view of engineering.The key to modeling anything correctly is to be aware of exactly what assumptions your choice of model corresponds to and how well they match the real-world processes underlying your data.So you've fitted some curve, but why that curve? What are the implications of assuming linearity between your input and output domains? What have you assumed about the distribution of noise over your predictions? How about over your input features? How would you expect the bias and variance of your predictions to degrade if any of these assumptions no longer held?As an engineer, if you don't understand the statistics behind your models enough to answer to such questions, their real-world applications aren't going to go far beyond wishful garbage-in-garbage-out number crunching. > So you've fitted some curve, but why that curve?Because it's the curve that's obtained by that particular minimization criteria, which is the minimization of the L² norm.If some other criteria was used, or other approximation function, then the result would also be valid.It appears that you are unaware that essentially all engineering in general, and whole field of computational mechanics in particular, is founded on what can be described as curve fitting. Whether it's plain old least squares approximation (in particular, moving least squares) or other techniques focused on the minimization of some other norm (Galerkin-type methods, for instance) the basis is all the same.> What are the implications of assuming linearity between your input and output domains?In short, because analytic functions and Taylor's theorem exist.Is it that hard?> As an engineer, if you don't understand the statistics behind your models enough to answer to such questions, their real-world applications aren't going to go far beyond wishful garbage-in-garbage-out number crunching.You know nothing about engineering, and somehow you're assuming that everything can only be valid if it's interpreted as a statistical problem.This isn't true, and it ignores complete knowledge field in physics and mathematics. I agree: If you have data X and outputs y and must have a linear function of X to minimize square error on y and only y, then yes we have defined our task.I think it's a lucky engineer that has their end-goal defined as such a crisp mathematical task. Defining what is an optimal way to do a fit, for the surrounding application, usually requires thought. Often the reasonableness of the curve does matter, for example if it will be evaluated in locations other than the training data. Also, there may be some choice in how to set up the problem. For example, maybe we're allowed to transform the inputs with some fixed non-linear functions to create extra features. Adding these extra features to the design matrix will improve fit, so an engineer might consider it, but they'll need to be careful about over-fitting if they care about a surrounding application. The single most important fact about linear regression is the Fritsch-Waugh-Lovell theorem. It's the whole reason why we're not scared as shit of using linear models.https://en.wikipedia.org/wiki/Frisch%E2%80%93Waugh%E2%80%93L...Basically, it means that omitting variables is ok (subject to terms and conditions).Assuming that IQ and shoe size are uncorrelated, but that income is a function of IQ and shoe size, I can run a regression:income = a + b * IQ + residualsand trust my estimate of b. FWL theorem says that running that regression and thenresiduals = c + d * shoe sizeis the same as runningincome = (a+c) + bIQ + dshoe size + epsIt's an amazing result and it's all based on linear algebra, no belief in statistics required. Eh, orthogonalizing variables is a necessary precondition for this. When was the last time variables were totally uncorrelated and you didn't know it beforehand? Unfortunately such cases are the exception rather than the rule, and you lose interpretability by projecting into a low-rank orthogonal subspace. Not that interpretability is necessarily the be-all and end-all, but the "terms and conditions" are pretty important here! > When was the last time variables were totally uncorrelated and you didn't know it beforehandWhenever we have an actual theory of the underlying phenomena.It's common in econometrics to impose orthogonality conditions on the theoretical equations so structural parameters can be recovered; or so we can filter partial correlations. For example: a regression of prices and quantities sold will mix the effect of shifts in demand and in supply. Solution: find something that correlated with demand quantities but not price directly (for example, changes in consumers' wealth).A realistic, much cooler example: > no belief in statistics requiredYes, but now you have a statistical assumption: that your structural model holds.To make use of the Fritsch-Waugh-Lovell theorem, you must test that your orthogonality assumption holds using statistics :) There is one conceptually simple issue often missing in such nicely presented write-ups (and which appears to be missing here): error in the abscissa ('x-axis') values. Things like time-series tend to dominate such analytics and in such data collections it's typically assumed that the time-stamp data is of suitable accuracy that any error there can be neglected. But there are many other data sources where there is notable error in both 'x' and 'y' data for which commonly employed linear regression doesn't allow, (quick example: my friend the hydrologist collects flow rates in rivers at sample transverse distances which are hard to be sure of as one is dangling above the raging waters). As a respectable starting point for regression which allows for error in both axes i'd recommend Deming regression: https://en.wikipedia.org/wiki/Deming_regression Those cases are also handled by least squares regression, mainly variants of total least squares regression. One special case of this general effect is non-linear coding error, especially in cases when it ends up being the level of a covariate (i.e. the log of the covariate) that matters for causal inference, or when the covariates are categorical or isotonic.The paper "Let's Put Garbage Can Regressions and Garbage Can Probits Where They Belong" by Achen  is a great discussion about some particular properties of this, and the tacit assumptions used to ignore it.In that paper, it's demonstrated that with just a tiny bit of coding error in the covariates, you can end up with a fitted regression coefficient that is statistically significant and has the wrong sign -- even when there is no noise whatsoever in the target variable (i.e. you can set up a toy example in which the target variable is synthetically generated as a true linear function of two covariates with positive coefficients, then perform a slight non-linear distortion on one of the covariates, regress the synthetic target variable on the clean covariate and the distorted covariate, and get wildly incorrect coefficients that appear to be statistically significant).People seem to think these toy example are some kind of alien phenomenon that could never happen with real-world data, but the paper is very explicit in the construction of the example data set. It's not harebrained or contrived, like Anscombe's Quartet or anything -- it's very much a plausible data set.I think it's not hyperbolic at all to say that results like this more or less conclusively show that naive linear regression cannot be trusted. If you're careful with model validation, using randomized hold out data, lots of diagnostic plotting and sanity checking, then regression is a fine tool. But if you do something shocking like take two different univariate models with the same target, fit their regression coefficients, and then select the model with a more favorable t-stat as "the winner" then you are committing an egregious statistical fallacy that often, in real world situations, is giving you not just an inaccurate answer, but an answer pointing totally in the opposite direction of the truth.What's frightening to me is that across many industries, even in places like high finance -- where "real money is on the line" -- it is extremely common to see huge business intelligence systems predicated entirely on this type of fallacious statistical approach with regression. Sadly, it's often because the regression approach was historically more tractable and the fallacies weren't as well known. And so as certain people gained more senior positions and sought to retain political control of the business tools that they oversaw, they grasped for convenient fictions like "interpretability" to justify their political choice to shun modern techniques. Anecdotal thought, I find practical guides and tutorials like this far more useful for learning machine learning than any of the public MOOCs like Andrew Ng's machine learning class. There's a lot of dense content shown in video lectures and very little practice. I almost had to google every topic and find a write-up like this to drive each concept home. Why you are not testing independence of errors? http://people.duke.edu/~rnau/testing.htm A common gotcha with linear regression. Imagine that your data points look like they roughly fit the shape of a tilted ellipsis. The linear regression will not run along the axis of the ellipsis, the slope will typically be smaller. This is because you're minimizing the square distance along the y axis, not the distance from a point to the line itself. What you appear to want is "total least squares": Do you mean "ellipse"? I was a bit confused about the "ellipsis" comment, since an ellipsis looks like a collection of collinear points. Why you are not testing for constant variance? https://onlinecourses.science.psu.edu/stat501/node/367 Because, to quote Box,‘To make the preliminary test on variances is rather like putting to sea in a rowing boat to find out whether conditions are sufficiently calm for an ocean liner to leave port!’ (Box 1953)Plot the residuals instead. Ctrl-F "Gauss"0 matchesCtrl-F "Markov"0 matchesThe fuck? Only the single most beautiful result in the field. Please don't misunderstand me, the presentation is great and it is quite practical, I just wish people would remember that very smart individuals have been coming along for pretty much the entirety of humanity's existence.https://en.wikipedia.org/wiki/Gauss%E2%80%93Markov_theoremThe proof is hinted at in the writeup, so why not link to it? I'm just not sure that it makes sense to add that to the article. The article focuses on the "how" of linear regression, and presents an interesting case for using non-analytic solutions, even though the analytic solution is easy. The Gauss-Markov theorem is more of a "why". I can imagine enjoying a discussion about why we use linear regression, or why regularization is important, or the "why" of many things, but those are simply different articles.The fact that an article doesn't mention a beautiful result in the field is hardly a valid criticism of the article. I don't understand why they down vote you. Probably because they think that they know Statistics. Programmers at Hacker News always think that they are the master mind of Statistics. In fact they know almost nothing about Statistics. For example if you don't test for constand variance and independance of variances then you cannot graduate from Department ofStatistis. But here they down vote you for knowing real Statistics. > But here they down vote you for knowing real Statistics.That is quite literally a No True Scotsman argument.GP was likely downvoted since the Ctrl-F meme adds nothing to the discussion. Personally, I felt that the omission was rather important. When the first year's worth of many statistics programs can be condensed into one beautiful theorem, which is centuries old, I feel it's disrespectful to one's forebears to ignore it.However, I get the impression that once the meme sensor goes off, any links remain unvisited and reading comprehension may suffer. So I learned something here, too.I'd edit the original to remove the meme, but since I can't do that at this point, readers will have to try and control their Internet Police tendencies and read a few sentences.Or not. Rather a shame, hopefully the author of the piece will fix it and this will all be superfluous. [post author here]FWIW I did not find your original comment particularly offensive, and didn't downvote (I think years on the internet have made me resilient).The Gauss-Markov theorem is very cool. However, as I discovered while doing some reading for this article - whole books were written on linear regression. The amount of information to convey is almost limitless and I didn't want to turn this one into a multi-part article. I simply wanted to write down (code and math) the things I find most pertinent to understanding how the regression works. Fair enough! If you felt like adding it as a footnote, great, if it doesn't fit the purpose, that's fine too. And to reiterate, it's a beautifully laid out, concise, high-content piece of writing. I should have phrased it differently; how it's said may matter more than what's said.Essentially I wrote a reaction post but, given the amount of revising and reanalysis/examination-of-assumptions I've been doing lately, I ought to have said something like:"This piece is an excellent tutorial on how to do linear regression, why gradient descent is fundamental (far beyond OLS or fixed effects models), and so forth. However, if you are interested in why all of these things are possible in the multivariate case as well as the simple bivariate case, you might find the Gauss-Markov theorem to be a 'light bulb' moment." Because that's how it happened to me.Cheers, and thanks for writing a great piece that I will be pointing others towards. Thanks for the feedback and the useful reference. I'll definitely add the Gauss-Markov theorem to my to-look-in-depth-at-later list and may write up about it at some future point. To be honest, the attitude doesn't help. Did the author omit something that you find important? Well, you can point it out saying something like "hey, nice informational article, but I would have included X, Y and Z." It's basically what you said, but in a more constructive manner. Mm, probably best if the next time I have that sort of urge, I try writing it up and see if what I'm suggesting is even compatible with what the author sought to accomplish.I momentarily forgot how difficult it can be to stay tightly focused on the parts that the author wants to emphasize. It's an important result, and hinted at directly by the derivation, but ultimately it's not my decision to make. Please don't make such assumptions and generalizations about the HN community. It predictably derails the thread and distracts from substantive discussion. Search: