
An introduction to machine learning through polynomial regression - rickdeveloper
https://rickwierenga.com/blog/machine%20learning/polynomial-regression.html
======
geoalchimista
I'm really not so amused by the bloating of words like "machine learning
through polynomial regression". People have been doing polynomial regression
for centuries (almost as old as linear regression) and how does it become a
new thing under the umbrella of machine learning? And not to mention, normal
equation is vastly superior to gradient descent in solving out the
coefficients for polynomial regression. (The criticism is not directed to OP
but to the field in general.)

~~~
curiousgal
"Machine Learning" these days is people rediscovering statistics.

It's always amusing how stats are underappreciated. I was at a career fair the
other day talking to a manager at HSBC, he was surprised to know that a stats
curriculum includes stochastic calculus...

~~~
solveit
I'm surprised too. Do you actually mean that most stats majors learn
stochastic calculus or just that it's possible to take stochastic calculus?

As a pure mathematician, my experience is that stochastic calculus was only
taught to very upper-level undergraduates or graduate students because of its
measure theory prerequisite. And given the number of maths majors that
graduate without ever having taken a course in measure theory, I would be
surprised if many stats majors actually learned stochastic calculus. Unless
you guys did what the physicists do and just taught the theorems without going
too deeply into the foundations, I guess?

~~~
curiousgal
I should have specified that this took place in France. Engineering students,
are first introduced to probability during the CPGE (intensive math, physics,
economics, 2 or 3 year track). Said introduction is based on measure theory
(sigma algebras, Lebesgue integral, etc). Once at a statistics engineering
school, all of the stats courses are based on probability and measure theory.
Then you get into martingales and Markov chains and the rest of the stochastic
processes follow along.

I think that many people don't realize that Statistics is heavily based on
probability, so basically anything "stochastic" shouldn't be that far fetched.

------
graycat
Polynomial regression tends to be numerically unstable, e.g., under reasonable
assumptions has for its normal equation matrix the notorious Hilbert matrix.

But can take an approach via a set of polynomials orthogonal over the data
points on the X axis and then for the coveted coefficients of the fitted
polynomial just do orthogonal projections. This approach is much more stable
numerically.

If for some reason want to insist on taking the normal equation approach, then
maybe do a numerically exact solution of the system of equations and/or matrix
inversion: For this, of course, start with rational numbers. Multiply through
by powers of 10 until have all whole numbers. Then for sufficiently many
single precision prime numbers, solve the equations in the field of integers
modulo each of those primes. For multiplicative inverse, that is a special
case of the Euclidean greatest common divisor (least common multiple)
algorithm. Then for the final answers, multiple precision, use the Chinese
remainder theorem.

For full details, see a paper by M. Newman about 1966 or so in a journal of
the US National Bureau of Standards, maybe _Solving Equations Exactly_.

The main point of Newman's approach is that get exact multiple precision
answers where nearly all the arithmetic is standard machine integer
arithmetic. For that might need a few lines of assembler.

There is a sometimes useful related point: Even if the normal equations matrix
has no inverse, don't give up yet! The normal equations still have a solution,
actually infinitely many solutions, and any of those solutions will minimize
the sum of the squared errors and give the same fitted values although with
different coefficients. If want the coefficients exactly, then can be
disappointed. But if really want just the fitted values, are still okay!

------
Bostonian
Once people understand polynomial regression and why high-order polynomial
regression often does not work, they can be introduced to splines, one of the
most commonly used nonparametric regression methods, which are just piecewise
polynomials.

------
BeeBoBub
This is a fun blog post, but I thought it was a little hard to follow. A few
observations:

    
    
      When we added the features a new problem emerged: their 
      ranges are very different from X1 meaning that a small 
      change in θ2, θ3, θ4 have much bigger imopacts than 
      changing θ1. This causes problems when we are fitting the 
      values θ later on.
    

This was a little confusing because you reference θ2, θ3, θ4 without
explicitly showing them in h(x).

    
    
      Because we will be using the hypothesis function many 
      times in the future it should be very fast. Right now h 
      can only compute one the prediction for one training 
      example at a time. We can change that by vectorizing it
    

What does it mean for _h_ to compute something? Why is vectorizing better?
Context about the computation is needed to determine if vectoring will speed
computation.

Why do you use gradient descent when you can use a closed-form solution to
solve the regression? It would be nice to discuss both gradient descent and
the closed-form solution.

You cover a lot of topics in this blog post which have a lot of nuance and
depth (e.g. random initial weights) that merit whole posts on their own.

~~~
rickdeveloper
Thank you so much for your feedback!

You are completely right and I have updated the post. (should be online within
a few minutes.)

> You cover a lot of topics in this blog post which have a lot of nuance and
> depth (e.g. random initial weights) that merit whole posts on their own.

I agree the post is quite long. The reason for that is because I wrote the
initial version for Google Code In, a programming competition for high
schoolers. It had to cover a list of concepts and I wanted to explain them
well instead of just giving a quick introduction so it ended up being quite
long.

It would definitely be interesting to write another article on symmetry
braking some time.

------
xrd
I'm amazed this was written by a 16 year old. Nice work.

------
amelius
The figure corresponding to overfitting doesn't seem right.

More intuitive figures can be found e.g. here:

[https://www.quora.com/What-are-the-key-trade-offs-between-
ov...](https://www.quora.com/What-are-the-key-trade-offs-between-overfitting-
and-underfitting)

------
rsrsrs86
I think the "give your data to a computer and ask it to find patterns" kinda
defeats the purpose of showing what machine learning is.

There is a lot involved in framing a problem in a way that it can be solved by
a computer...

------
salty_biscuits
Lost me at norm of y equals m...

