That's because it doesn't factorize the input into separate meaningful parts. The next step in LSTMs will be to operate over relational graphs so they only have to learn function and not structure at the same time. That way they will be able to generalize more between different situations and be much more useful.
Graphs can be represented as adjacency matrices and data as vectors. By multiplying vector with matrix, you can do graph computation. Recurring graph computations are a lot like LSTMs. That's why I think LSTMs are going to become more invariant to permutation and object composition in the future, by using graph data representation instead of flat euclidean vectors, and typed data instead of untyped data. So they are going to become strongly typed, graph RNNs. With such toys we can do visual and text based reasoning, and physical simulation.
Here is an implementation of "A simple neural network module for relational reasoning" in pytorch:
I'm not sure they're quite as complicated as you're making them out to be. If they are, then try a GRU instead ;)
Another one is from Thomas Kipf: "Graph Convolutional Matrix Completion" https://arxiv.org/abs/1706.02263
Instead of handwaving about "forgetting", it is IMO better to understand the problem of vanishing gradients and how can forget gates actually help with them.
And Jürgen Schmidhuber, the inventor of LSTM, is a co-author of the RHN paper.
It's well understood that CFGs can not be induced from examples. Which accounts for the fact that LSTMs cannot learn "counting" in this manner, nor indeed can any other learning method that learns from examples.
 "Strings generated from"
 The same goes for any formal grammars other than finite ones, as in simpler than regular.
I think you mean something more specific (e.g. polynomial in a particular sense).
+ Automatic Learning of Context-Free Grammar (Chen et al.): http://www.aclweb.org/anthology/O06-1004
+ Learning context-free grammars from structural data in polynomial time (Sakakibara,
1994): http://www.sciencedirect.com/science/article/pii/03043975909... (uses positive skeletons)
Nice overview: http://staff.icar.cnr.it/staff/ruffolo/public_html/progetti/...
However- his algorithm learns CFGs from structural data, which is to say,
derivation trees (think parse trees). So it's completely irrelevant to the
example in the article, that attempts to learn a^nb^n from examples of its
derivations -which remains impossible.
As to the other paper, by Chen, Tseng and Chen, that's about learning a CFG
that reproduces the strings in a corpus- so learning a CFG of a corpus as
opposed to the grammar of a context-free language (therefore, a context-free
grammar) which, again, remains impossible.
Finite grammar induction from positive examples only is feasible in polynomial time, so Peter Norvig's notebook will not cause the fabric of the space-time continuum to be torn asunder, I am sure.
As an example, it is possible to induce a counting algorithm from examples with high probability: https://colala.bcs.rochester.edu/papers/piantadosi2012bootst...
That's learning the CFG a^nb^n for n > 0 in a nutshell and is therefore impossible to learn, and indeed the LSTM in the article falters after test-time n exceeds training-time n by a few units.
The paper you link to discusses subject matter I'm not familiar with (bootsrapping as in psychology, say) so I'm not really qualified to opine on it. However, from a very quick look it seems like they're learning to associate counting words with numbers- sorry if I'm wrong about that. Still, that's not "learning to count", where you have to be able to make the inductive leap that n + 1 > n, for all n in your target set of numbers in addition to learning the recursive structure of number strings.
Finally, from the little I've seen of attempts to model counting with machine learning, there hasn't yet been much success, probably because it's impossible to learn a language with both structure and meaning from examples only of its structure and no access to their meaning.
If I am missing something please correct me.
Is anyone working with LSTMs in a production setting? Any tips on what are the biggest challenges?
Jeremy Howard said in fast.ai course that in the applied setting, simpler GRUs work much better and has replaced LSTMs. Comments about this?
Usually as long as you have a high amount of regularization and use truncated backprop through time in training you can learn some fairly useful classification and forecasting problems.
Beyond that standard neural net tuning applies. Eg: normalize your data, pay attention to your weight initialization, understand what loss function you're using,..
The long part of "LSTM" means remember good long ranging dependencies.
I said, IIRC, that they're very similar in terms of results and GRUs are a little simpler. I've seen some papers show better results for GRU, and visa versa.
What does GRU stand for?
You would think an article like this would define LSTM somewhere.
Disclaimer: I just learned what an LSTM was. But it's a good article.