Isn’t John D Cook arguing for iterative methods there? It does not read as a defense of “using the normal equations” or direct methods. In fact, the reference to sparsity makes you think John May be referring to SGD or a similar iterative method.
Obviously, any solution you get solves the normal equation so it’s unclear what the parent comment meant by using the normal equation is superior to SGD.
Stochastic gradient descent only approximates the solution, and it may do that by looking at a single equation (or a small batch of them) at a time, without ever bothering with the whole system.
Is it approximating the gradient or approximating the solution?
If it’s the gradient, doesn’t it have the real gradient in expectation, so you can run as many iterations to get epsilon precision up to what your machine will support?
It is just "what you do". If it is a small problem the default is qr decomp of A. If you are worried about speed do a cholesky decomp of A'A. If the problem is big (usually because of a sparse A) then you do conjugate gradient (because fill in will bite with a direct method). If it is really, really big (A can't fit in memory) then it isn't clear what the "thing to do" is. It is probably "sketching" but in ML/neural network land everyone just does SGD, which you can think of as a monte carlo estimate of the gradient (A for a linear problem). Maybe sketching and SGD are equivalent (or an appemroximation). "what you do" is based on convergence and stability characteristics.
Obviously, any solution you get solves the normal equation so it’s unclear what the parent comment meant by using the normal equation is superior to SGD.