This is an excellent book written by Michael Nielsen. I had tried to learn neural nets earlier, but either found the descriptions too simplistic to come up with a clear understanding, or too complicated. IMHO, in this book, the right learning curve exists, and dare I say, trains up our biological neurons at the optimal learning rate :)
Michael has written a great book. I humbly propose that readers who would like another intro, and who learn better from conceiving of the same subject in different ways, check out our introduction to neural nets:
Highly recommended! I was able to get a pretty good understanding of Deep Learning using this book. The book contains really good analogies (like explaining how a basic neuron works using a real life decision model) to help you understand the complex topic of Deep Learning.
>Can anyone help me with a rudimentary, simpler explanation for this?
Imagine you're in a hilly field and you need to get to the lowest point because some high winds are coming, the lower the better. Normally you could just look across the whole field and see which is the lowest point, but it's pitch black outside and you can't see anything.
Now hills can slope East-West and North-South. You need to move your hand around your immediate area and feel where the hill seems to be going downward both East-West and North-South. When you've found the point that feels to be sloping downward in both directions then you can move that direction and get lower than you were before. You keep repeating this process until you reach a point where feeling around you only seems to take you up again.
Closer to the math: Your height on the hill is your "loss function" (ie how wrong you are), East-West and North-South are two variables: x and y. Feeling where in your immediate area is down for both EW and NS is Partial Derivative (backwards 6 thingie) of x and y in relation to the cost function (your height up the hill). In our hill example we're trying to find ideal physical coordinates, in a NN we're trying to find the ideal weights.
Just like in our metaphor if the Hills are big and smooth you'll find the lowest possible point quite easily. If the hills are very bumpy it can be quite hard to find the lowest point.
Unlike the hill example our problem in machine learning likely has hundreds or thousands of dimenions we have to "feel" in before we can move rather than 2. This example is different than "Hill climbing" algorithms because in Hill Climbing you would move either EW or NS and see if you where lower than before. But Hill Climbing doesn't take into account that you might be going down EW but actualy moving upwards more slowly NS. The way the entire surface seems to be sloping when you put your hand on it is essentially the Gradient.
I spent some time last year learning about the math behind neural networks and summarized it in this step by step post that you might find helpful [1]. The post has been on HackerNews in the past and is the top Google result for "backpropagation".
If you haven't already, you should check out the first and second week of Andrew Ng's Coursera Course on Machine learning. He exclusively talks about gradient descent the first few weeks.
https://www.coursera.org/learn/machine-learning
Second the motion. Ang really explains gradient descent very well in that course.
As far as the equations go, if you don't know multi-variable calculus, you might not be able to follow the actual derivations, but I don't think that's all that crucial, depending on what your goals are. Certainly you can apply this stuff without knowing the calculus behind it. And in the ang course, he gives you all the derivations you need to implement gradient descent for various purposes.
Anyway, here's my quick and dirty, way too high level overview of the whole calc business:
All you're really trying to do is optimize (minimize) a function. Given a point on the graph of that function, you need to know which direction to move in in order to get a smaller (more minimal) output. To do that, you calculate the slope at that point. Calculating the slope at a point on a curve is exactly what calculus does for you. If you were working with only one variable, the derivations would be trivial, but once you get into higher dimensional spaces and the need for partial derivatives, that's where the calculus gets a little trickier. But in concept, you're always just doing the same thing... calculating the slope so you know where to move, and by how much (the steeper the slope, the bigger the hop you make in a given iteration).