For example, consider that the RNN in Karpathy's post (the one linked to in this article) was capable of generating well-formed XML, generating matching opening and closing tags with an apparently unbounded amount of material between them. No Markov chain can do this.
The other question is, are those difficult to learn things truly worth the cost of training and running an RNN? If a fast and simple markov chain serves, as is likely the case in practical settings, then it is better to go with the markov chain. The RNN will still make obvious mistakes, all while correctly using subtle rules that trouble even humans. Unfortunately, this combination is exactly the kind of thing that will leave observers less than impressed: "Yes I know it rambles insensibly but look, it uses punctuation far better than your average forum dweller!" Alas, anyone who has gone through the trouble of making a gourad shaded triangle spin in Mode X and proudly showing their childhood friends, can explain just what sort of reaction to expect.
Eh, so, the moral here is pay attention to cost effectiveness and don't make things any more complicated than they need to be.
Yoav Goldberg treats much the same thing as this blog post but with far more detail and attention to subtlety here: http://nbviewer.jupyter.org/gist/yoavg/d76121dfde2618422139
The one characteristic a Markov chain must have is that the transition probabilities are completely determined by its current state. This property is true for both RNNs and what you call Markov chains. The main difference is that the state space for RNNs is a lot bigger and better at describing the current context (it was designed to be).
So in this sense, actually a RNN is a kind of Hidden Markov Chain - one with more structures added to it. The structure might an RNN better than an HMM but it doesn't make it more general, it makes it more specific.
It's like saying, "aren't binary trees just linked lists?"
But the thing that (both in theory and in practice) distinguishes RNNs from simpler constructs with very finite amounts of state is precisely what happens with not-so-tiny amounts of output. RNNs can produce sizable syntactically valid chunks of languages with "nested" structure -- open and close tags in XML, parentheses in Lisp, curly braces in C, etc.; Markov chains can't. It seems reasonable to guess that RNNs will also be able to produce better-structured natural language text than Markov chains ever will.
Show us a page of fake Shakespeare from the RNN and a page from the Markov chain, and I don't think it will be so difficult to tell which is which.
In my experience (having used both nets and markov processes in computational creativity), nets hold more promise for higher-dimensional problems (e.g., 2d image generation).
I'm all for having more toys the toybox though, so I hope work continues on both/all fronts.
Recurrent NNs are a different beast, predicting sequential processes and so operating in an area closer to hidden Markov chain that have some successes but their operations and direction don't seem as convnets.
Note that since the game of go is a sequence of moves, one might training a RNN to play go. However, AlphaGo (very roughly) used a Convnet to tell a Monte Carlo Tree structure which final positions looked good.
I'm going partly from: http://neuralnetworksanddeeplearning.com/chap6.html#other_ap...
It also depends on how the HMMs are trained. Are they trained by reducing the loss over each document, or are they trained by just taking the frequency of all the needed transitions?
Training using joint loss will be much more effective for this problem, training conditional random fields and then sampling them will also be extremely good for this problem and allows arbitrary features for each letter.
RNNs do have a great representational power but the lack of training jointly makes them as linear as CNNs. This representational power saves the day but will never really catch long distance dependencies.
Both approaches have the so called label-bias. (if HMM is trained by frequency)
CRFs wouldn't have that problem, and RNNs somehow seem to avoid it by great representational power although the training does say that label bias exists there.