w_fit = X \ yy; % Matlab/Octave, X is NxD, yy is Nx1
w_fit = np.linalg.lstsq(X, yy) # NumPy equivalent
If you've thrown in lots of features, hoping to get good predictions, you may want to do "regularization". The computation is no harder for simple versions. One way to do "L2 regularization" is to add some extra rows to X and yy before fitting the weights:
% Matlab/Octave... the NumPy is nearly as simple.
[N, D] = size(X);
lambda = 1; % regularization constant
X = [X; eye(D)*sqrt(lambda)];
y = [yy; zeros(D,1)]
It's fun to work out the least squares maths and implement these things from scratch. But it has been done, and it's not what matters for most applications.
EDIT: a discussion with good links on linear models and beyond from the other day https://news.ycombinator.com/item?id=12237998
Yes this x1000. Normally I rely on F Values to get an indication of how reliable the thing is - the package I use spits out a tonne of other information - Cook's D etc but most of the output is just noise to me at university all I was taught was T-test (I'm an engineer) and the (now retired) statistician at my work only gave me a very cursory explanation and he basically emphasized F value.
I really hate that linear regression is being explained in "Machine Learning" language.
"Features", "design matrix", "prediction s".
There is absolutely no need to take the statistical route in least squares regression. In fact, it needlessly complicates things. Interpreting least squares regression as the minimization of the sum of squares of the residual is a straight-forward way to understand the technique.
Regression is a statistical term. If it's going to be used for an application involving data and real decisions, not understanding the statistical issues is irresponsible. That's not at odds with viewing it as a procedure to minimize square errors.
To clarify: There are probabilistic models (involving Gaussians and stuff) where some interpretations suggest least-squares estimation. That wasn't the distinction I was trying to make. Statisticians who hate probabilistic models will still say that it's important to think about what you put in X and y before you do X\y, and to be careful about what you can conclude from the result.
You use the phrase "involving Gaussians and stuff" to "clarify," and you ask others for rigor? Physician, heal thyself! And please don't speak for statisticians, they are a peculiar bunch. Ask the same question of 2 statisticians and you may well end up with 3 answers.
No, it isn't.
It's curve fitting. The goal is to fit a curve to the data.
In any engineering application, no one cares what's the statistical interpretation. They only want to fit a curve to the data, and do it in an optimal way.
Hence, you get the sum of squares of the difference between the curve and the sample point (i.e., the residual) and minimize it by determining the parameters which minimize the regression.
The key to modeling anything correctly is to be aware of exactly what assumptions your choice of model corresponds to and how well they match the real-world processes underlying your data.
So you've fitted some curve, but why that curve? What are the implications of assuming linearity between your input and output domains? What have you assumed about the distribution of noise over your predictions? How about over your input features? How would you expect the bias and variance of your predictions to degrade if any of these assumptions no longer held?
As an engineer, if you don't understand the statistics behind your models enough to answer to such questions, their real-world applications aren't going to go far beyond wishful garbage-in-garbage-out number crunching.
Because it's the curve that's obtained by that particular minimization criteria, which is the minimization of the L² norm.
If some other criteria was used, or other approximation function, then the result would also be valid.
It appears that you are unaware that essentially all engineering in general, and whole field of computational mechanics in particular, is founded on what can be described as curve fitting. Whether it's plain old least squares approximation (in particular, moving least squares) or other techniques focused on the minimization of some other norm (Galerkin-type methods, for instance) the basis is all the same.
> What are the implications of assuming linearity between your input and output domains?
In short, because analytic functions and Taylor's theorem exist.
Is it that hard?
> As an engineer, if you don't understand the statistics behind your models enough to answer to such questions, their real-world applications aren't going to go far beyond wishful garbage-in-garbage-out number crunching.
You know nothing about engineering, and somehow you're assuming that everything can only be valid if it's interpreted as a statistical problem.
This isn't true, and it ignores complete knowledge field in physics and mathematics.
I think it's a lucky engineer that has their end-goal defined as such a crisp mathematical task. Defining what is an optimal way to do a fit, for the surrounding application, usually requires thought. Often the reasonableness of the curve does matter, for example if it will be evaluated in locations other than the training data. Also, there may be some choice in how to set up the problem. For example, maybe we're allowed to transform the inputs with some fixed non-linear functions to create extra features. Adding these extra features to the design matrix will improve fit, so an engineer might consider it, but they'll need to be careful about over-fitting if they care about a surrounding application.
Basically, it means that omitting variables is ok (subject to terms and conditions).
Assuming that IQ and shoe size are uncorrelated, but that income is a function of IQ and shoe size, I can run a regression:
income = a + b * IQ + residuals
and trust my estimate of b. FWL theorem says that running that regression and then
residuals = c + d * shoe size
is the same as running
income = (a+c) + bIQ + dshoe size + eps
It's an amazing result and it's all based on linear algebra, no belief in statistics required.
Whenever we have an actual theory of the underlying phenomena.
It's common in econometrics to impose orthogonality conditions on the theoretical equations so structural parameters can be recovered; or so we can filter partial correlations. For example: a regression of prices and quantities sold will mix the effect of shifts in demand and in supply. Solution: find something that correlated with demand quantities but not price directly (for example, changes in consumers' wealth).
A realistic, much cooler example:
Yes, but now you have a statistical assumption: that your structural model holds.
To make use of the Fritsch-Waugh-Lovell theorem, you must test that your orthogonality assumption holds using statistics :)
The paper "Let's Put Garbage Can Regressions and Garbage Can Probits Where They Belong" by Achen  is a great discussion about some particular properties of this, and the tacit assumptions used to ignore it.
In that paper, it's demonstrated that with just a tiny bit of coding error in the covariates, you can end up with a fitted regression coefficient that is statistically significant and has the wrong sign -- even when there is no noise whatsoever in the target variable (i.e. you can set up a toy example in which the target variable is synthetically generated as a true linear function of two covariates with positive coefficients, then perform a slight non-linear distortion on one of the covariates, regress the synthetic target variable on the clean covariate and the distorted covariate, and get wildly incorrect coefficients that appear to be statistically significant).
People seem to think these toy example are some kind of alien phenomenon that could never happen with real-world data, but the paper is very explicit in the construction of the example data set. It's not harebrained or contrived, like Anscombe's Quartet or anything -- it's very much a plausible data set.
I think it's not hyperbolic at all to say that results like this more or less conclusively show that naive linear regression cannot be trusted. If you're careful with model validation, using randomized hold out data, lots of diagnostic plotting and sanity checking, then regression is a fine tool. But if you do something shocking like take two different univariate models with the same target, fit their regression coefficients, and then select the model with a more favorable t-stat as "the winner" then you are committing an egregious statistical fallacy that often, in real world situations, is giving you not just an inaccurate answer, but an answer pointing totally in the opposite direction of the truth.
What's frightening to me is that across many industries, even in places like high finance -- where "real money is on the line" -- it is extremely common to see huge business intelligence systems predicated entirely on this type of fallacious statistical approach with regression. Sadly, it's often because the regression approach was historically more tractable and the fallacies weren't as well known. And so as certain people gained more senior positions and sought to retain political control of the business tools that they oversaw, they grasped for convenient fictions like "interpretability" to justify their political choice to shun modern techniques.
 < http://www.columbia.edu/~gjw10/achen04.pdf >
‘To make the preliminary test on variances is rather like putting to sea in a rowing boat to find out whether conditions are sufficiently calm for an ocean liner to leave port!’ (Box 1953)
Plot the residuals instead.
The fuck? Only the single most beautiful result in the field. Please don't misunderstand me, the presentation is great and it is quite practical, I just wish people would remember that very smart individuals have been coming along for pretty much the entirety of humanity's existence.
The proof is hinted at in the writeup, so why not link to it?
The fact that an article doesn't mention a beautiful result in the field is hardly a valid criticism of the article.
That is quite literally a No True Scotsman argument.
GP was likely downvoted since the Ctrl-F meme adds nothing to the discussion.
However, I get the impression that once the meme sensor goes off, any links remain unvisited and reading comprehension may suffer. So I learned something here, too.
I'd edit the original to remove the meme, but since I can't do that at this point, readers will have to try and control their Internet Police tendencies and read a few sentences.
Or not. Rather a shame, hopefully the author of the piece will fix it and this will all be superfluous.
FWIW I did not find your original comment particularly offensive, and didn't downvote (I think years on the internet have made me resilient).
The Gauss-Markov theorem is very cool. However, as I discovered while doing some reading for this article - whole books were written on linear regression. The amount of information to convey is almost limitless and I didn't want to turn this one into a multi-part article. I simply wanted to write down (code and math) the things I find most pertinent to understanding how the regression works.
Essentially I wrote a reaction post but, given the amount of revising and reanalysis/examination-of-assumptions I've been doing lately, I ought to have said something like:
"This piece is an excellent tutorial on how to do linear regression, why gradient descent is fundamental (far beyond OLS or fixed effects models), and so forth. However, if you are interested in why all of these things are possible in the multivariate case as well as the simple bivariate case, you might find the Gauss-Markov theorem to be a 'light bulb' moment." Because that's how it happened to me.
Cheers, and thanks for writing a great piece that I will be pointing others towards.
I momentarily forgot how difficult it can be to stay tightly focused on the parts that the author wants to emphasize. It's an important result, and hinted at directly by the derivation, but ultimately it's not my decision to make.