I'm really not so amused by the bloating of words like "machine learning through polynomial regression". People have been doing polynomial regression for centuries (almost as old as linear regression) and how does it become a new thing under the umbrella of machine learning? And not to mention, normal equation is vastly superior to gradient descent in solving out the coefficients for polynomial regression. (The criticism is not directed to OP but to the field in general.)
As others have mentioned I did not intend to bloat by using the word 'machine learning.' Polynomial regression is taught as first concept in many machine learning courses so I figured it might be a good way to introduce newcomers to the field. Many of the posts currently available only show how to make another neural network for MNIST, but do not cover the background theory. I know that polynomial regression is a concept of statistics, but I would argue most of machine learning is too. That is the reason I choose this concept as an intro to machine learning (and consequently statistics).
Thank you for your feedback. I really appreciate it. I have added a paragraph on the normal equation (should be online anytime) and why you might/might not want to use it.
"Machine Learning" these days is people rediscovering statistics.
It's always amusing how stats are underappreciated. I was at a career fair the other day talking to a manager at HSBC, he was surprised to know that a stats curriculum includes stochastic calculus...
I'm surprised too. Do you actually mean that most stats majors learn stochastic calculus or just that it's possible to take stochastic calculus?
As a pure mathematician, my experience is that stochastic calculus was only taught to very upper-level undergraduates or graduate students because of its measure theory prerequisite. And given the number of maths majors that graduate without ever having taken a course in measure theory, I would be surprised if many stats majors actually learned stochastic calculus. Unless you guys did what the physicists do and just taught the theorems without going too deeply into the foundations, I guess?
I should have specified that this took place in France. Engineering students, are first introduced to probability during the CPGE (intensive math, physics, economics, 2 or 3 year track). Said introduction is based on measure theory (sigma algebras, Lebesgue integral, etc). Once at a statistics engineering school, all of the stats courses are based on probability and measure theory. Then you get into martingales and Markov chains and the rest of the stochastic processes follow along.
I think that many people don't realize that Statistics is heavily based on probability, so basically anything "stochastic" shouldn't be that far fetched.
I think you have it backwards: this is saying "hey, polynomial regression isn't fancy, so by standing on it you can see what machine learning is all about.
We had an American Statistical Association workshop for the local Southern California chapter recently. The speaker for the workshop mentioned he had to often change Logistic regression wikipedia page back to statistic technique from Machine Learning because people keep on changing it to ML.
At the same time ASA have release their 2019 report about data science and the cross road statistic is at. It's an interesting intersection ML/data science & Statistic is having at this point.
The report also give statistic version of their definition of statistic vs machine learning and what each field emphasize on. I think often people are just really unclear on statistic and ML field because they never sat down to think about the what define each field. It also end up getting political and people being defensive. But I think at least this report give a statistic view point.
Isn’t John D Cook arguing for iterative methods there? It does not read as a defense of “using the normal equations” or direct methods. In fact, the reference to sparsity makes you think John May be referring to SGD or a similar iterative method.
Obviously, any solution you get solves the normal equation so it’s unclear what the parent comment meant by using the normal equation is superior to SGD.
Stochastic gradient descent only approximates the solution, and it may do that by looking at a single equation (or a small batch of them) at a time, without ever bothering with the whole system.
Is it approximating the gradient or approximating the solution?
If it’s the gradient, doesn’t it have the real gradient in expectation, so you can run as many iterations to get epsilon precision up to what your machine will support?
It is just "what you do". If it is a small problem the default is qr decomp of A. If you are worried about speed do a cholesky decomp of A'A. If the problem is big (usually because of a sparse A) then you do conjugate gradient (because fill in will bite with a direct method). If it is really, really big (A can't fit in memory) then it isn't clear what the "thing to do" is. It is probably "sketching" but in ML/neural network land everyone just does SGD, which you can think of as a monte carlo estimate of the gradient (A for a linear problem). Maybe sketching and SGD are equivalent (or an appemroximation). "what you do" is based on convergence and stability characteristics.
i have no idea what he's advocating for? gaussian elimination? matrix decomposition? first doesn't work for non-square and the second is still slower than gradient descent often (in particular in the case that you don't need the exact minimum [such a in data science]).
Polynomial regression tends to be numerically unstable, e.g., under reasonable assumptions has for its normal equation matrix the notorious Hilbert matrix.
But can take an approach via a set of polynomials orthogonal over the data points on the X axis and then for the coveted coefficients of the fitted polynomial just do orthogonal projections. This approach is much more stable numerically.
If for some reason want to insist on taking the normal equation approach, then maybe do a numerically exact solution of the system of equations and/or matrix inversion: For this, of course, start with rational numbers. Multiply through by powers of 10 until have all whole numbers. Then for sufficiently many single precision prime numbers, solve the equations in the field of integers modulo each of those primes. For multiplicative inverse, that is a special case of the Euclidean greatest common divisor (least common multiple) algorithm. Then for the final answers, multiple precision, use the Chinese remainder theorem.
For full details, see a paper by M. Newman about 1966 or so in a journal of the US National Bureau of Standards, maybe Solving Equations Exactly.
The main point of Newman's approach is that get exact multiple precision answers where nearly all the arithmetic is standard machine integer arithmetic. For that might need a few lines of assembler.
There is a sometimes useful related point: Even if the normal equations matrix has no inverse, don't give up yet! The normal equations still have a solution, actually infinitely many solutions, and any of those solutions will minimize the sum of the squared errors and give the same fitted values although with different coefficients. If want the coefficients exactly, then can be disappointed. But if really want just the fitted values, are still okay!
Once people understand polynomial regression and why high-order polynomial regression often does not work, they can be introduced to splines, one of the most commonly used nonparametric regression methods, which are just piecewise polynomials.
This is a fun blog post, but I thought it was a little hard to follow. A few observations:
When we added the features a new problem emerged: their
ranges are very different from X1 meaning that a small
change in θ2, θ3, θ4 have much bigger imopacts than
changing θ1. This causes problems when we are fitting the
values θ later on.
This was a little confusing because you reference θ2, θ3, θ4 without explicitly showing them in h(x).
Because we will be using the hypothesis function many
times in the future it should be very fast. Right now h
can only compute one the prediction for one training
example at a time. We can change that by vectorizing it
What does it mean for _h_ to compute something? Why is vectorizing better? Context about the computation is needed to determine if vectoring will speed computation.
Why do you use gradient descent when you can use a closed-form solution to solve the regression? It would be nice to discuss both gradient descent and the closed-form solution.
You cover a lot of topics in this blog post which have a lot of nuance and depth (e.g. random initial weights) that merit whole posts on their own.
You are completely right and I have updated the post. (should be online within a few minutes.)
> You cover a lot of topics in this blog post which have a lot of nuance and depth (e.g. random initial weights) that merit whole posts on their own.
I agree the post is quite long. The reason for that is because I wrote the initial version for Google Code In, a programming competition for high schoolers. It had to cover a list of concepts and I wanted to explain them well instead of just giving a quick introduction so it ended up being quite long.
It would definitely be interesting to write another article on symmetry braking some time.