Thank you for your feedback. I really appreciate it. I have added a paragraph on the normal equation (should be online anytime) and why you might/might not want to use it.
It's always amusing how stats are underappreciated. I was at a career fair the other day talking to a manager at HSBC, he was surprised to know that a stats curriculum includes stochastic calculus...
As a pure mathematician, my experience is that stochastic calculus was only taught to very upper-level undergraduates or graduate students because of its measure theory prerequisite. And given the number of maths majors that graduate without ever having taken a course in measure theory, I would be surprised if many stats majors actually learned stochastic calculus. Unless you guys did what the physicists do and just taught the theorems without going too deeply into the foundations, I guess?
I think that many people don't realize that Statistics is heavily based on probability, so basically anything "stochastic" shouldn't be that far fetched.
At the same time ASA have release their 2019 report about data science and the cross road statistic is at. It's an interesting intersection ML/data science & Statistic is having at this point.
The report also give statistic version of their definition of statistic vs machine learning and what each field emphasize on. I think often people are just really unclear on statistic and ML field because they never sat down to think about the what define each field. It also end up getting political and people being defensive. But I think at least this report give a statistic view point.
e.g., introduction to space-shuttle flying, through flight simulator on a gaming console.
Completely depends on the size of your data set. I would not want to pseudo-invert a design matrix > 1000 samples.
Obviously, any solution you get solves the normal equation so it’s unclear what the parent comment meant by using the normal equation is superior to SGD.
If it’s the gradient, doesn’t it have the real gradient in expectation, so you can run as many iterations to get epsilon precision up to what your machine will support?
I’m not too familiar with CG.
What clues you in that he’s thinking more CG than SGD?
But can take an approach via a set of polynomials orthogonal over the data points on the X axis and then for the coveted coefficients of the fitted polynomial just do orthogonal projections. This approach is much more stable numerically.
If for some reason want to insist on taking the normal equation approach, then maybe do a numerically exact solution of the system of equations and/or matrix inversion: For this, of course, start with rational numbers. Multiply through by powers of 10 until have all whole numbers. Then for sufficiently many single precision prime numbers, solve the equations in the field of integers modulo each of those primes. For multiplicative inverse, that is a special case of the Euclidean greatest common divisor (least common multiple) algorithm. Then for the final answers, multiple precision, use the Chinese remainder theorem.
For full details, see a paper by M. Newman about 1966 or so in a journal of the US National Bureau of Standards, maybe Solving Equations Exactly.
The main point of Newman's approach is that get exact multiple precision answers where nearly all the arithmetic is standard machine integer arithmetic. For that might need a few lines of assembler.
There is a sometimes useful related point: Even if the normal equations matrix has no inverse, don't give up yet! The normal equations still have a solution, actually infinitely many solutions, and any of those solutions will minimize the sum of the squared errors and give the same fitted values although with different coefficients. If want the coefficients exactly, then can be disappointed. But if really want just the fitted values, are still okay!
When we added the features a new problem emerged: their
ranges are very different from X1 meaning that a small
change in θ2, θ3, θ4 have much bigger imopacts than
changing θ1. This causes problems when we are fitting the
values θ later on.
Because we will be using the hypothesis function many
times in the future it should be very fast. Right now h
can only compute one the prediction for one training
example at a time. We can change that by vectorizing it
Why do you use gradient descent when you can use a closed-form solution to solve the regression? It would be nice to discuss both gradient descent and the closed-form solution.
You cover a lot of topics in this blog post which have a lot of nuance and depth (e.g. random initial weights) that merit whole posts on their own.
You are completely right and I have updated the post. (should be online within a few minutes.)
> You cover a lot of topics in this blog post which have a lot of nuance and depth (e.g. random initial weights) that merit whole posts on their own.
I agree the post is quite long. The reason for that is because I wrote the initial version for Google Code In, a programming competition for high schoolers. It had to cover a list of concepts and I wanted to explain them well instead of just giving a quick introduction so it ended up being quite long.
It would definitely be interesting to write another article on symmetry braking some time.
More intuitive figures can be found e.g. here:
There is a lot involved in framing a problem in a way that it can be solved by a computer...