
Understanding LSTM networks - michael_nielsen
http://colah.github.io/posts/2015-08-Understanding-LSTMs/
======
reader5000
Whats amazing to me is that if I understand correctly backprop still works. It
is very odd that SGD on the error function for some training data is
conceptually equivalent to teaching all the gates for each hidden feature when
to open/close given the next input in a sequence.

~~~
colah3
> Whats amazing to me is that if I understand correctly backprop still works.

Yep! One computes the gradient with backpropagation and trains LSTMs on that.

> It is very odd that SGD on the error function for some training data is
> conceptually equivalent to teaching all the gates for each hidden feature
> when to open/close given the next input in a sequence.

Agreed, it's pretty remarkable the things one can learn with gradient descent.
I'd like to understand this better.

~~~
hyperbovine
I don't understand why this is remarkable, could you elaborate? You have a
smooth map from inputs to outputs, you differentiate it via automated
applications of the chain rule and use this to make the error function
smaller. Why should this be remarkable?

~~~
colah3
If I didn't know that neural nets worked, and someone explained the idea, I'd
strongly expect them to get stuck in local minima, instead of learning this
elaborate behavior. (That's basically what happens with a normal RNN, I
think.)

It's probably a case of my intuitions about optimization being really broken
in high-dimensional spaces, but I'd like to improve that. I'd also like to
understand how the cost surface we're optimizing on interacts with network
architecture decisions. Both, unfortunately, are very hard demands.

~~~
nabla9
>If I didn't know that neural nets worked, and someone explained the idea, I'd
strongly expect them to get stuck in local minima

This has been the general consensus (based on intuition) until recently. It
turns out that it's not as big problem as tought. Almost all local minima have
very similar error values. The bigger issue is combinatorially large number of
saddle points where gradient becomes zero with very few downward curving
directions. Saddle-free Newton method is one way around that.

Open Problem: The landscape of the loss surfaces of multilayer networks,
Choromanska, LeCun, Arous
[http://jmlr.org/proceedings/papers/v40/Choromanska15.pdf](http://jmlr.org/proceedings/papers/v40/Choromanska15.pdf)

------
ot
colah, your posts combine a deep level of understanding and an exceptional
clarity. These are both rare, especially in the cargo-cult-driven world of
neural networks and deep learning.

I hope you keep writing as much as you can. Thanks!

------
d136o
Thanks colah, that was a very readable walk-through. I've been making my way
through Bishop's PRML ch 5 to get as much of a handle as possible on NNs, but
your intro here to LSTM's makes me want to jump ahead and skip to the new
stuff :)

------
p1esk
Michael, this post nicely completes your book about neural networks. I was a
little surprised you didn't write it yourself.

------
ambicapter
Anybody know what he uses for his diagrams?

~~~
colah3
All of the diagrams in this post were made in Inkscape, with the LaTeX plugin
for equations.

------
mistermaster
great explanation. many thanks! hochreiter is genius!

~~~
colah3
> hochreiter is genius!

It's really impressive to me how farsighted his work on LSTMs was.

In any case, I'm glad you liked the post.

