But how to build a Markov chain that does what an LLM does?
I've actually thought of this from time to time and come up with things like 'skip' states that just don't change the context on meaningless words, I've thought of maintaining distant states in some type of context on the markov chain, multiple contexts in multiple directions mixed with a neural network at the end (as per the 2D image example), etc. But then the big issue is in building the Markov state network.
The attention mechanism and subsequent multi-layer neural network it uses is honestly extremely inelegant. Basically a entire datacenter sledgehammer of contexts and back-propagated neural networks to make something that works while Markov chains would easily run on an Apple 2 if you got it right (it's very fast to linearly navigate a chain of states). But the hard part is making Markov chains that can represent language well. The best we've done is very simple linear next letter predictors.
I do honestly think there may be some value in stealing word weights from an attention mechanism and making Markov chains out of each of them. So each word starts navigating a markov chain of nearby words to end up at new states. You'd still mix all the contexts at the end with a neural network but it skips the computationally expensive attention mechanism. Still a hard problem to build these Markov chains sensibly though. Deepseek has shown us there's huge opportunities in optimizing the attention mechanism and Markov chains are super computationally simple but the 'how do you build good Markov chains' is a very hard problem.