Can anyone help me with a rudimentary, simpler explanation for this? (a link may also do)
Imagine you're in a hilly field and you need to get to the lowest point because some high winds are coming, the lower the better. Normally you could just look across the whole field and see which is the lowest point, but it's pitch black outside and you can't see anything.
Now hills can slope East-West and North-South. You need to move your hand around your immediate area and feel where the hill seems to be going downward both East-West and North-South. When you've found the point that feels to be sloping downward in both directions then you can move that direction and get lower than you were before. You keep repeating this process until you reach a point where feeling around you only seems to take you up again.
Closer to the math: Your height on the hill is your "loss function" (ie how wrong you are), East-West and North-South are two variables: x and y. Feeling where in your immediate area is down for both EW and NS is Partial Derivative (backwards 6 thingie) of x and y in relation to the cost function (your height up the hill). In our hill example we're trying to find ideal physical coordinates, in a NN we're trying to find the ideal weights.
Just like in our metaphor if the Hills are big and smooth you'll find the lowest possible point quite easily. If the hills are very bumpy it can be quite hard to find the lowest point.
Unlike the hill example our problem in machine learning likely has hundreds or thousands of dimenions we have to "feel" in before we can move rather than 2. This example is different than "Hill climbing" algorithms because in Hill Climbing you would move either EW or NS and see if you where lower than before. But Hill Climbing doesn't take into account that you might be going down EW but actualy moving upwards more slowly NS. The way the entire surface seems to be sloping when you put your hand on it is essentially the Gradient.
As far as the equations go, if you don't know multi-variable calculus, you might not be able to follow the actual derivations, but I don't think that's all that crucial, depending on what your goals are. Certainly you can apply this stuff without knowing the calculus behind it. And in the ang course, he gives you all the derivations you need to implement gradient descent for various purposes.
Anyway, here's my quick and dirty, way too high level overview of the whole calc business:
All you're really trying to do is optimize (minimize) a function. Given a point on the graph of that function, you need to know which direction to move in in order to get a smaller (more minimal) output. To do that, you calculate the slope at that point. Calculating the slope at a point on a curve is exactly what calculus does for you. If you were working with only one variable, the derivations would be trivial, but once you get into higher dimensional spaces and the need for partial derivatives, that's where the calculus gets a little trickier. But in concept, you're always just doing the same thing... calculating the slope so you know where to move, and by how much (the steeper the slope, the bigger the hop you make in a given iteration).
A 1D line generally has an easy gradient downhill. (x is input, y is error).
Each input is a dimension.
In 2, 3, or more dimensions, downhill can be subtle. Calculus is used to find the slope at a multidimensional point.
You descend in the direction the calculus proposes, which is downhill toward the lower error.
The error is a scalar, it is chopped up proportionally by size of contribution.
Each input is weighted, the weights are adjusted downward.
You are trying to minimize the error in input space.
To attribute blame for the error so it can be minimized, the calculus chain rule passes the buck back to the inputs - this is back propagation.
Multilayer networks do this multiple times: pass the error back, chopped up according to which input made the mistake (proportionally).
So the end result is a series of adjustments to all the neural weights which will reduce this error.
Rinse and repeat many many times and you will get close to a generalised solution.
Local minima used to be a worry; but in practice the more dimensions the less local minima differ from the global error minima.
I've started skimming over this and it looks pretty useful so far.