Right we’ve now gotten to the stage of this AI cycle where we start using the new tool to solve problems old tools could solve. Saying a transformer can solve any Formally decidable problem if given enough tape isn’t saying much. It’s a cool proof, don’t mean to deny that, but it doesn’t mean much practically as we already have more efficient tools that can do the same.
What I don't get is... didn't people prove that in the 90s for any multi-layer neural network? Didn't people prove transformers are equivalent on the transformers paper?
Yes they did. A two layer network with enough units in the hidden layer can form any mapping to any desired accuracy.
And a two layer network with single-delay feedback from the hidden units to themselves can capture any dynamic behavior (to any desired accuracy).
Adding layers and more structured architectures creates the opportunity for more efficient training and inference, but doesn't enable any new potential behavior. (Except in the sense that reducing resource requirements can allow impractical problems to become practical.)
Putting a 50 bucks bet that some very smart kid in the near future will come with some enthrophy-meets-graphical-structures theorem which gives an estimation of how the loss of information is affected by the size and type of the underlying structure holding this information.
It took a while for people to actually start talking about LZW as grammar algo, not a "dictionary"-based algorithm. Which is then reasoned about in a more general sense again by https://en.wikipedia.org/wiki/Sequitur_algorithm.
This is not to say that LLMs are not cool, we put them to use every day. But the reasoning part is never going to be a trustworthy one without a 100% discreet system, which can infer the syllogistic chain with zero doubt and 100% tracable origin.