I'm really not so amused by the bloating of words like "machine learning through polynomial regression". People have been doing polynomial regression for centuries (almost as old as linear regression) and how does it become a new thing under the umbrella of machine learning? And not to mention, normal equation is vastly superior to gradient descent in solving out the coefficients for polynomial regression. (The criticism is not directed to OP but to the field in general.)
As others have mentioned I did not intend to bloat by using the word 'machine learning.' Polynomial regression is taught as first concept in many machine learning courses so I figured it might be a good way to introduce newcomers to the field. Many of the posts currently available only show how to make another neural network for MNIST, but do not cover the background theory. I know that polynomial regression is a concept of statistics, but I would argue most of machine learning is too. That is the reason I choose this concept as an intro to machine learning (and consequently statistics).
Thank you for your feedback. I really appreciate it. I have added a paragraph on the normal equation (should be online anytime) and why you might/might not want to use it.
"Machine Learning" these days is people rediscovering statistics.
It's always amusing how stats are underappreciated. I was at a career fair the other day talking to a manager at HSBC, he was surprised to know that a stats curriculum includes stochastic calculus...
I'm surprised too. Do you actually mean that most stats majors learn stochastic calculus or just that it's possible to take stochastic calculus?
As a pure mathematician, my experience is that stochastic calculus was only taught to very upper-level undergraduates or graduate students because of its measure theory prerequisite. And given the number of maths majors that graduate without ever having taken a course in measure theory, I would be surprised if many stats majors actually learned stochastic calculus. Unless you guys did what the physicists do and just taught the theorems without going too deeply into the foundations, I guess?
I should have specified that this took place in France. Engineering students, are first introduced to probability during the CPGE (intensive math, physics, economics, 2 or 3 year track). Said introduction is based on measure theory (sigma algebras, Lebesgue integral, etc). Once at a statistics engineering school, all of the stats courses are based on probability and measure theory. Then you get into martingales and Markov chains and the rest of the stochastic processes follow along.
I think that many people don't realize that Statistics is heavily based on probability, so basically anything "stochastic" shouldn't be that far fetched.
I think you have it backwards: this is saying "hey, polynomial regression isn't fancy, so by standing on it you can see what machine learning is all about.
We had an American Statistical Association workshop for the local Southern California chapter recently. The speaker for the workshop mentioned he had to often change Logistic regression wikipedia page back to statistic technique from Machine Learning because people keep on changing it to ML.
At the same time ASA have release their 2019 report about data science and the cross road statistic is at. It's an interesting intersection ML/data science & Statistic is having at this point.
The report also give statistic version of their definition of statistic vs machine learning and what each field emphasize on. I think often people are just really unclear on statistic and ML field because they never sat down to think about the what define each field. It also end up getting political and people being defensive. But I think at least this report give a statistic view point.
Isn’t John D Cook arguing for iterative methods there? It does not read as a defense of “using the normal equations” or direct methods. In fact, the reference to sparsity makes you think John May be referring to SGD or a similar iterative method.
Obviously, any solution you get solves the normal equation so it’s unclear what the parent comment meant by using the normal equation is superior to SGD.
Stochastic gradient descent only approximates the solution, and it may do that by looking at a single equation (or a small batch of them) at a time, without ever bothering with the whole system.
Is it approximating the gradient or approximating the solution?
If it’s the gradient, doesn’t it have the real gradient in expectation, so you can run as many iterations to get epsilon precision up to what your machine will support?
It is just "what you do". If it is a small problem the default is qr decomp of A. If you are worried about speed do a cholesky decomp of A'A. If the problem is big (usually because of a sparse A) then you do conjugate gradient (because fill in will bite with a direct method). If it is really, really big (A can't fit in memory) then it isn't clear what the "thing to do" is. It is probably "sketching" but in ML/neural network land everyone just does SGD, which you can think of as a monte carlo estimate of the gradient (A for a linear problem). Maybe sketching and SGD are equivalent (or an appemroximation). "what you do" is based on convergence and stability characteristics.
i have no idea what he's advocating for? gaussian elimination? matrix decomposition? first doesn't work for non-square and the second is still slower than gradient descent often (in particular in the case that you don't need the exact minimum [such a in data science]).