138 points by Jasamba on March 8, 2016 | hide | past | favorite | 12 comments

 This is an excellent book written by Michael Nielsen. I had tried to learn neural nets earlier, but either found the descriptions too simplistic to come up with a clear understanding, or too complicated. IMHO, in this book, the right learning curve exists, and dare I say, trains up our biological neurons at the optimal learning rate :)
 Michael has written a great book. I humbly propose that readers who would like another intro, and who learn better from conceiving of the same subject in different ways, check out our introduction to neural nets:http://deeplearning4j.org/neuralnet-overview.html
 For C fans I ported the Python example from this excellent book here:https://github.com/dougszumski/NNet
 Highly recommended! I was able to get a pretty good understanding of Deep Learning using this book. The book contains really good analogies (like explaining how a basic neuron works using a real life decision model) to help you understand the complex topic of Deep Learning.
 The book is excellent, but i have some trouble understanding, Gradient descent (mostly the calculus equations)Can anyone help me with a rudimentary, simpler explanation for this? (a link may also do)
 >Can anyone help me with a rudimentary, simpler explanation for this?Imagine you're in a hilly field and you need to get to the lowest point because some high winds are coming, the lower the better. Normally you could just look across the whole field and see which is the lowest point, but it's pitch black outside and you can't see anything.Now hills can slope East-West and North-South. You need to move your hand around your immediate area and feel where the hill seems to be going downward both East-West and North-South. When you've found the point that feels to be sloping downward in both directions then you can move that direction and get lower than you were before. You keep repeating this process until you reach a point where feeling around you only seems to take you up again.Closer to the math: Your height on the hill is your "loss function" (ie how wrong you are), East-West and North-South are two variables: x and y. Feeling where in your immediate area is down for both EW and NS is Partial Derivative (backwards 6 thingie) of x and y in relation to the cost function (your height up the hill). In our hill example we're trying to find ideal physical coordinates, in a NN we're trying to find the ideal weights.Just like in our metaphor if the Hills are big and smooth you'll find the lowest possible point quite easily. If the hills are very bumpy it can be quite hard to find the lowest point.Unlike the hill example our problem in machine learning likely has hundreds or thousands of dimenions we have to "feel" in before we can move rather than 2. This example is different than "Hill climbing" algorithms because in Hill Climbing you would move either EW or NS and see if you where lower than before. But Hill Climbing doesn't take into account that you might be going down EW but actualy moving upwards more slowly NS. The way the entire surface seems to be sloping when you put your hand on it is essentially the Gradient.
 I spent some time last year learning about the math behind neural networks and summarized it in this step by step post that you might find helpful [1]. The post has been on HackerNews in the past and is the top Google result for "backpropagation".
 If you haven't already, you should check out the first and second week of Andrew Ng's Coursera Course on Machine learning. He exclusively talks about gradient descent the first few weeks. https://www.coursera.org/learn/machine-learning
 Second the motion. Ang really explains gradient descent very well in that course.As far as the equations go, if you don't know multi-variable calculus, you might not be able to follow the actual derivations, but I don't think that's all that crucial, depending on what your goals are. Certainly you can apply this stuff without knowing the calculus behind it. And in the ang course, he gives you all the derivations you need to implement gradient descent for various purposes.Anyway, here's my quick and dirty, way too high level overview of the whole calc business:All you're really trying to do is optimize (minimize) a function. Given a point on the graph of that function, you need to know which direction to move in in order to get a smaller (more minimal) output. To do that, you calculate the slope at that point. Calculating the slope at a point on a curve is exactly what calculus does for you. If you were working with only one variable, the derivations would be trivial, but once you get into higher dimensional spaces and the need for partial derivatives, that's where the calculus gets a little trickier. But in concept, you're always just doing the same thing... calculating the slope so you know where to move, and by how much (the steeper the slope, the bigger the hop you make in a given iteration).
 Gradient descent seeks the lowest prediction error when compared with the training answer.A 1D line generally has an easy gradient downhill. (x is input, y is error).Each input is a dimension.In 2, 3, or more dimensions, downhill can be subtle. Calculus is used to find the slope at a multidimensional point.You descend in the direction the calculus proposes, which is downhill toward the lower error.The error is a scalar, it is chopped up proportionally by size of contribution.Each input is weighted, the weights are adjusted downward.You are trying to minimize the error in input space.To attribute blame for the error so it can be minimized, the calculus chain rule passes the buck back to the inputs - this is back propagation.Multilayer networks do this multiple times: pass the error back, chopped up according to which input made the mistake (proportionally).So the end result is a series of adjustments to all the neural weights which will reduce this error.Rinse and repeat many many times and you will get close to a generalised solution.Local minima used to be a worry; but in practice the more dimensions the less local minima differ from the global error minima.
 Go downhill.
 Another freely available book on Neural Networks is here:http://hagan.okstate.edu/nnd.htmlI've started skimming over this and it looks pretty useful so far.

Search: