
A line-by-line layman’s guide to Linear Regression using TensorFlow - rainboiboi
https://medium.com/@derekchia/a-line-by-line-laymans-guide-to-linear-regression-using-tensorflow-3c0392aa9e1f
======
gnulinux
Is there a reason to perform linear regression using an optimizer like SGD or
Adam as opposed to using least squares? For large matrices is optimization
more scalable than solving the linear equation? Or is it because since
optimizers are more expressive, it's a programmatic convenience/readibility
thing?

~~~
cuchoi
"Least squares" means that the overall solution minimizes the sum of the
squares. In some cases you can get to that result analytically (I imagine that
this is what you are referring to as "least squares"). In other cases you
can't do it analytically. Also, sometimes it just a lot faster to use SGD or
other algorithms.

~~~
gnulinux
Yes, I meant "linear least squares" which is "(X.T X)^{-1} X.T y".

> Also, sometimes it just a lot faster to use SGD or other algorithms.

Right, that's what I thought. Is this because essentially optimization will be
bounded by the number of of epochs but linear least squares will be bounded by
matrix operations (so scale with N). Which means, if you can solve the problem
in small number of epochs (say, 200) and N is very large then SGD will be
faster. Is this correct?

EDIT: Obviously, I don't think _all_ regression problems can be solved this
way but in this blog post their loss function can be solved by linear least
squares. If you solve the optimization problem analytically, you'll get "(X.T
X)^{-1} X.T y".

~~~
jing
I find it hard to believe that SGD would be faster than the closed form
solutions for linear regression (gels, gelsd etc.). The closed-form solutions
give a lot of other benefits in practical settings as well which makes them
more likely to be used if possible. SGD + related optimizers give benefits
with non-convex or non-analytical loss functions or with non-linear layers /
more than one layer.

~~~
gnulinux
Then why would anyone use tensorflow with this loss function in practice. In
my school's ML class, we used this technique too (in addition to closed form
solution). Is there any practical reason to use an optimizer to solve a linear
problem?

~~~
jing
Note that it's not just the loss function. It's the loss function _combined_
with a very specific problem formulation - namely a neural network with only
linear activations (equivalent to a 0-layer network). Once you go to non-
linear layers or a different loss it's no longer solved analytically.

I do see a lot of people writing tutorials like OP's. See for example:

[https://towardsdatascience.com/linear-regression-using-
gradi...](https://towardsdatascience.com/linear-regression-using-gradient-
descent-97a6c8700931)

The existence of these articles should not be taken as an indication of best
practice. They often have the goal of teaching SGD in a simplified setting,
_not_ teaching best practice for LLS. I suppose only nice thing about using TF
/ SGD for such a simple problem is that you now have starting point for
solving more complex problems (RELU activation, cross-entropy loss, more
layers, etc.).

A few other points as to why you would never SGD for LLS:

1) it's always way slower than the closed form matrix solutions

2) if you're doing SGD instead of just GD, there's noise in which "rows" are
in a given batch - as a result, repeated runs may not converge to exactly the
same final weights. This never happens with the analytical solution which
always gets exactly the same result.

3) if you're doing this as part of a data science pipeline which is likely the
case in the real world, you'll likely want to do some cross-validation. In the
SGD case you have to recompute the entire solution for each fold whereas in
the LLS case you can immediately compute CVs once you've calculated the
initial XTX / XTYs. This makes the _process_ of using LLS even faster than
SGD.

