
Machine Learning Notes - Linear Regression - tadasv
http://vilkeliskis.com/blog/2013/08/14/machine_learning_part_1_linear_regression.html
======
darkxanthos
I'm just wrapping up a full semester course on multiple regression and reading
this is a great and different perspective on it.

I definitely appreciate the simple approach in the article. If the OP is like
myself, perhaps he's posting this to better his understanding and leaving
artifacts for others to follow as they learn. I have to point out, there's so
much more happening in regression. To do it well, read further on it.

As a concrete example of why- The author mentions the R^2 value but doesn't
seem to warn that adding more variables to your model will artificially
increase it. For this reason, a better value is the "Adjusted R^2" which
adjusts for that. Also testing the validity of your model, building it up from
scratch, understanding you can't predict outside of the domain of your
independent variables, etc.

With that out of the way, I very much enjoyed seeing some of the math behind
this. My class was entirely focused on just learning to use a statistical
package to run regression. That's perfectly adequate, fine, and all I'll use
on a day to day basis. Understanding what's going on beneath the covers has
always just enabled me to be more powerful at the given task.

Thanks!

~~~
tadasv
Thanks for you comment. I was trying to keep everything as simple as possible,
so you could easily bootstrap your python project with code examples. You must
explore the topic further yourself! It's very easy to keep adding more and
more stuff to the post to the point where nobody will ever read it, or the
information just becomes incorrect. So yes, you need to drill down on your
own.

------
pyoung
Not to bag on the write up, as it was well done, but does linear regression
really qualify as machine learning? Almost every stats 101 course covers the
topic, and probably 99% of people who use linear regressions in their day-to-
day work would not call it machine learning. I know that linear regression is
sometimes presented in Machine Learning courses, but I always thought is was
done as a refresher, and not as actual course material of any significant
weight.

~~~
yummyfajitas
Why wouldn't it be machine learning? It certainly fits the definition of
supervised learning (aka regression).

Admittedly you do usually learn it in unsexy statistics classes rather than
sexy machine learning classes...

~~~
aet
The lines have always been a little blurry for me. You have statistics,
machine learning, data mining, artificial intelligence. It all seems to
overlap heavily. I tend to consider statistics and data mining to be concerned
mostly with classical statistics, and machine learning and artificial
artificial intelligence to be concerned more with bayesian methods, and
algorithms. That being said, logistic regression seems like the quintessential
machine learning technique. So who knows, any experts care to comment?

~~~
christopheraden
As someone who did graduate work in Bayesian methods for a statistics master's
degree, I take offense to saying machine learning is not a concern of
statistics and data mining (but not really)! The hesitance towards Bayesian
methods seems more related to the discipline, and it seems that places that
call what they do "machine learning" tend to be less hostile towards the
explicit subjectivity of Bayes (I would highlight the word "explicit" in that
sentence--Frequentism has it's fair share of subjectivity as well).

There was a great post on Stats.SE a few years ago about the difference
between statistics and machine learning[1]. Leo Breiman once argued that
statistics tends to focus more on model fitting and checking, while machine
learning looked at prediction accuracy. The exchange between Andy Gelman and
Brendan O'Connor is pretty funny. It has been my personal experience however
that many people that apply a method that they brand as "machine learning" are
not as bothered with assumptions as my fellow conservative statisticians.

But statistics and machine learning are quite similar in foundation. Barring
the differences in terminology, as a professional statistician, I find I have
as little difficulty read machine learning papers and algorithms as I do
reading statistics ones.

[http://stats.stackexchange.com/questions/6/the-two-
cultures-...](http://stats.stackexchange.com/questions/6/the-two-cultures-
statistics-vs-machine-learning)

------
usamec
Totally bad article. It encourages bad practices like checking validity on the
same set as model was trained on. You should do some cross-validation or at
least split data into two parts, train model on the first part and test it on
the second part.

------
textminer
Nice write up! I'd caution just checking for an exactly-zero determinant. Read
up about ill-conditioned matrices, and maybe check that conditioning number
(or a determinant below a certain threshold) first. Also, work hard to never,
ever have to actually fully invert a matrix.

~~~
lp251
Checking that the determinant is below a threshold is not a valid test for
conditioning. Take epsilon*Identity, for example.

------
urschrei
Not bad, but I'd rather have seen statsmodels[1] (which is more intuitive to
use, and gives you more data, as well as methods for displaying it) than
sklearn used for the library. I understand the choice given that it's "machine
learning", but as the comments are demonstrating, the distinction's not
actually that clear.

[1]
[http://statsmodels.sourceforge.net/stable/gettingstarted.htm...](http://statsmodels.sourceforge.net/stable/gettingstarted.html)

