I can't find references to HMM-based large language models. Small HMM language models generate gibberish very similar to this.
A HMM consists of a state space, a state transition matrix, and an output probability matrix. A token space of 50k and a state space of something like 60k would have seemed impossible 10-20 years. It has only recently become viable.
Training using Baum-Welch on a big enough text data set would be interesting. It should be much faster than back-propagation with a transformer-model.
It's okayish. Considering 64G to 128G are available for (nerd) high-end consumers you're just off with a factor 5 (if we can squeeze out a little bit more performance).
Not exactly a few words in my experience, I would say every 100 words, if you sophisticate your Markov Chain (n-gram = 3 at minimum, using a good tokenizer, making it tailored to the training data, large training set (500Kbytes or +), intelligent fallback instead of random, etc.).
Neural-type models have long passed the point where markov chains made any sense by many orders of magnitude.
Markov models fail by being too opinionated about the style of compute.
In contrast, a linear tensor + non-linear function has incredible flexibility to transform the topology of information. Given large enough tensors, two such layers, with recurrence, can learn any mapping, static or dynamical. No priors (other than massive compute) needed.
All other neural architectures then are simply sparser arrangements, that bring compute demands down. Where the sparseness is fit to the type of problem.
Sparseness can be deeper but narrower information flows (thus “deep” learning). Or in lower numbers of weights to weight application (I.e. shared weights, like convolutions).