I’m the one who will fight you including with peer reviewed papers indicating that it is in fact due to tokenization. I’m too tired but will edit this for later, so take this as my bookmark to remind me to respond.
We know there are narrow solutions to these problems, that was never the argument that the specific narrow task is impossible to solve.
The discussion is about general intelligence, the model isn't able to do a task that it can do simply because it chooses the wrong strategy, that is a problem of lack of generalization and not a problem of tokenization. Being able to choose the right strategy is core to general intelligence, altering input data to make it easier for the model to find the right solution to specific questions does not help it become more general, you just shift what narrow problems it is good at.
I am aware of errors in computations that can be fixed by better tokenization (e.g. long addition works better tokenizing right-left rather than L-R). But I am talking about counting, and talking about counting words, not characters. I don’t think tokenization explains why LLMs tend to fail at this without CoT prompting. I really think the answer is computational complexity: counting is simply too hard for transformers unless you use CoT. https://arxiv.org/abs/2310.07923
Words vs characters is a similar problem, since tokens can be less one word, multiple words, or multiple words and a partial word, or words with non-word punctuation like a sentence ending period.
My intuition says that tokenization is a factor especially if it splits up individual move descriptions differently from other LLM's
If you think about how our brains handle this data input, it absolutely does not split them up between the letter and the number, although the presence of both the letter and number together would trigger the same 2 tokens I would think
I strongly believe that the problem isn't that tokenization isn't the underlying problem, it's that, let's say bit-by-bit tokenization is too expensive to run at the scales things are currently being ran at (openai, claude etc)
It's not just a current thing, either. Tokenization basically lets you have a model with a larger input context than you'd otherwise have for the given resource constraints. So any gains from feeding the characters in directly have to be greater than this advantage. And for CoT especially - which we know produces significant improvements in most tasks - you want large context.