
Exploring LSTMs - deafcalculus
http://blog.echen.me/2017/05/30/exploring-lstms/
======
visarga
LSTMs are both amazing and not quite good enough. They seem to be too
complicated for what they do well, and not quite complex enough for what they
can't do so well. The main limitation is that they mix structure with style,
or type with value. For example, if you want an LSTM to learn addition, if you
taught it to operate on numbers of 6 digits it won't be able to generalize on
numbers of 20 digits.

That's because it doesn't factorize the input into separate meaningful parts.
The next step in LSTMs will be to operate over relational graphs so they only
have to learn function and not structure at the same time. That way they will
be able to generalize more between different situations and be much more
useful.

Graphs can be represented as adjacency matrices and data as vectors. By
multiplying vector with matrix, you can do graph computation. Recurring graph
computations are a lot like LSTMs. That's why I think LSTMs are going to
become more invariant to permutation and object composition in the future, by
using graph data representation instead of flat euclidean vectors, and typed
data instead of untyped data. So they are going to become strongly typed,
graph RNNs. With such toys we can do visual and text based reasoning, and
physical simulation.

~~~
LukeB42
Any links to implementations?

~~~
visarga
Yes, one of the links is from DeepMind: "A simple neural network module for
relational reasoning"
[https://arxiv.org/abs/1706.01427](https://arxiv.org/abs/1706.01427)

Another one is from Thomas Kipf: "Graph Convolutional Matrix Completion"
[https://arxiv.org/abs/1706.02263](https://arxiv.org/abs/1706.02263)

~~~
LukeB42
Thanks for elucidating the contents of the DeepMind paper and for hipping me
to the Graph Convolutional Matrix Completion paper.

------
inlineint
I personally find recurrent highway networks (RHNs) as described in [1] to be
easier to understand and remember the formulas for than the original LSTM.
Because as they are generalizations of LSTM, if one understands RHNs, one can
understand LSTMs as just a particular case of RHN.

Instead of handwaving about "forgetting", it is IMO better to understand the
problem of vanishing gradients and how can forget gates actually help with
them.

And Jürgen Schmidhuber, the inventor of LSTM, is a co-author of the RHN paper.

[1] [https://arxiv.org/abs/1607.03474](https://arxiv.org/abs/1607.03474)

~~~
make3
Yes, but LSTMs are fucking everywhere in the industry so understanding them is
crucial

------
YeGoblynQueenne
In the experiment on teaching an LSTM to count, it's useful to note that the
examples it's trained on are derivations [1] from a grammar a^nb^n (with n >
0), a classic example of a Context-Freee Grammar (CFG).

It's well understood that CFGs can not be induced from examples. Which
accounts for the fact that LSTMs cannot learn "counting" in this manner, nor
indeed can any other learning method that learns from examples.

_______________

[1] "Strings generated from"

[2] The same goes for any formal grammars other than finite ones, as in
simpler than regular.

~~~
MrQuincle
> It's well understood that CFGs can not be induced from examples

I think you mean something more specific (e.g. polynomial in a particular
sense).

\+ Automatic Learning of Context-Free Grammar (Chen et al.):
[http://www.aclweb.org/anthology/O06-1004](http://www.aclweb.org/anthology/O06-1004)

\+ Learning context-free grammars from structural data in polynomial time
(Sakakibara, 1994):
[http://www.sciencedirect.com/science/article/pii/03043975909...](http://www.sciencedirect.com/science/article/pii/030439759090017C)
(uses positive skeletons)

Nice overview:
[http://staff.icar.cnr.it/staff/ruffolo/public_html/progetti/...](http://staff.icar.cnr.it/staff/ruffolo/public_html/progetti/projects/07.Learning%20di%20SCFG/SCFG%20learning/index-c1.pdf)

~~~
avmich
Also
[http://nbviewer.jupyter.org/url/norvig.com/ipython/xkcd1313....](http://nbviewer.jupyter.org/url/norvig.com/ipython/xkcd1313.ipynb)
, considering regexes are a subset of CFGs.

~~~
YeGoblynQueenne
From a quick glance, the example doesn't quite learn regular grammars -
rather, it reduces them to finite grammars (e.g. the finite grammar of all US
presidential winners and losers) and learns (or, more accurately, invents) a
regular expression for them.

Finite grammar induction from positive examples only is feasible in polynomial
time, so Peter Norvig's notebook will not cause the fabric of the space-time
continuum to be torn asunder, I am sure.

------
mrplank
LSTMs are on their retour in my opinion. They are a hack to make memory in
recurrent networks more persistent. In practice they overfit too easy. They
are being replaced with convolutional networks. Have a look at the latest
paper from Facebook about translation for more details.

~~~
Drdrdrq
The way I see it, the difference is that with CNN you have fixed maximum
timeframe in which knowledge about world is preserved, while LSTMs and RNNs in
general do not impose such restrictions. This makes them better suited for
some applications.

If I am missing something please correct me.

------
dirtyaura
Really great work on visualizing neurons!

Is anyone working with LSTMs in a production setting? Any tips on what are the
biggest challenges?

Jeremy Howard said in fast.ai course that in the applied setting, simpler GRUs
work much better and has replaced LSTMs. Comments about this?

~~~
agibsonccc
Yes the bulk of our business is time series. This includes everything from
hardware break downs to fraud detection. I think Jeremy has some good points
but in general, but I wouldn't assume that everything is binary. (By this, I
mean look at these kinds of terse statements with a bit of nuance)

Usually as long as you have a high amount of regularization and use truncated
backprop through time in training you can learn some fairly useful
classification and forecasting problems.

Beyond that standard neural net tuning applies. Eg: normalize your data, pay
attention to your weight initialization, understand what loss function you're
using,..

~~~
griffinkelly
So if LSTMs are purposely forgetting, do you need less training data than a
CNN?

~~~
agibsonccc
LSTMs don't "forget" more than "remember the things that matter". They don't
necessarily need less data. They do have a limit on the "length" of time steps
they can handle though. Eg: You can't do thousands in to the future (maybe a
few hundred or so)

The long part of "LSTM" means remember _good_ long ranging dependencies.

------
minimaxir
Is there code for the coloring of neurons per-character as in the post? I've
seen that type of visualization on similar posts and am curious if there is a
library for it. (the original char-rnn post
[[http://karpathy.github.io/2015/05/21/rnn-
effectiveness/](http://karpathy.github.io/2015/05/21/rnn-effectiveness/)]
indicates that it is custom Code/CSS/HTML)

~~~
trwoway
From Karpathy's blog post :
[http://cs.stanford.edu/people/karpathy/viscode.zip](http://cs.stanford.edu/people/karpathy/viscode.zip)

------
mrplank
Google Brain outperforms LSTMs with Convolutional Networks in speed and
accuracy, seeming to confirm LSTMs are not optimal for NLP at least:

[https://arxiv.org/pdf/1706.03762.pdf](https://arxiv.org/pdf/1706.03762.pdf)

------
Seanny123
Is the code for generating the reactions from the LSTM hidden units posted
anywhere? That was the best part for me and I'd love to use it in my own
projects.

~~~
Joris1225
In Andrej Karpathy's excellent blogpost on RNNs[1], he links to some of the
code he used for visualisation[2]. I have done something similar. In general:
put your the activations of each cell memory over time through tanh and color
it based on that.

[1] [http://karpathy.github.io/2015/05/21/rnn-
effectiveness/](http://karpathy.github.io/2015/05/21/rnn-effectiveness/) [2]
[http://cs.stanford.edu/people/karpathy/viscode.zip](http://cs.stanford.edu/people/karpathy/viscode.zip)

------
CyberDildonics
> I once though LSTMs were tricky, but LSTMs are actually very easy ...

You would think an article like this would define LSTM somewhere.

------
natch
LSTM is "Long Short Term Memory," since the tutorial never mentions what it
stands for.

[https://en.wikipedia.org/wiki/Long_short-
term_memory](https://en.wikipedia.org/wiki/Long_short-term_memory)

~~~
jeffjose
It's defined few sections in, but yes, your larger point stands.

------
raarts
Can someone provide a tl;dr ?

~~~
avaer
Neural nets that update over time need a way to know what to keep and what to
forget from last time. So let's learn what to keep and forget... by using more
neural nets.

Disclaimer: I just learned what an LSTM was. But it's a good article.

~~~
SAI_Peregrinus
And LSTM stands for Long/Short Term Memory, the definition is only found about
a third of the way through the article.

~~~
thrw1001
Knowing what the actual units do, I prefer spelling out the acronym as Long
Short-Term Memory. They implement short-term memory, like all RNNs do, with a
slight improvement to make it long.

