It's just so dissonant to me that the tokens in mathematics are the digits, and not bundles of digits. The idea of tokenization makes sense for taking the power off letters, it provides language agnosticism.
But for maths, it doesn't seem appropriate.
I wonder what the effect of forcing tokenization for each separate digit be.
This reminds me of the riddle of someone buying the numerals to put their address on their house. When you are looking at text, the point is all you have are the characters/symbols/tokens/whatever you want to call them. You can't really shepherd some over to their numeric value while leaving some at their token value. Unless you want to cause other issues when it comes time to reason about them later.
I'd hazard that the majority of numbers in most text are not such that they should be converted to a number, per se. Consider addresses, postal codes, phone numbers, ... ok, I may have run out of things to consider. :D
Perhaps I'm just missing fundamentals on tokenization.
But I fail to see how forcing tokenization at the digit level for numbers would somehow impact non numerical meanings of digits. The same characters always map to the same token through a simple mapping right? It's not like context and meaning changes tokenization:
That is:
my credit card ends in 4796 and my address is N street 1331
Parses to the same tokens as:
Multiply 4796 by 1331
So by tokenization digits we don't introduce the problem of different meanings to tokens depending on context.
I think I see your point, but how would you want to include localized numbers, such as 1,024 in a stream? Would you assume all 0x123 numbers are hex, as that is a common norm? Does the tokenizer already know to read scientific numbers? 1e2, as an example?
That is all to say that numbers in text are already surprisingly flexible. The point of taking the tokens is to let the model lean the flexibility. It is the same reason that we don't tokenize at the word level. Or try to get a soundex normalization. All of these are probably worth at least trying. May even do better in some contexts? The general framework has a reason to be, though.
I think that as long as the attention mechanism has been trained on each possible numerical token enough, this is true. But if a particular token is underrepresented, it could potentially cause inaccuracies.