The llm will know 123 and 45 is a contiguious number just like how humans can te...

TZubiri · 2024-10-11T14:31:36 1728657096

It's just so dissonant to me that the tokens in mathematics are the digits, and not bundles of digits. The idea of tokenization makes sense for taking the power off letters, it provides language agnosticism.

But for maths, it doesn't seem appropriate.

I wonder what the effect of forcing tokenization for each separate digit be.

taeric · 2024-10-11T22:36:13 1728686173

This reminds me of the riddle of someone buying the numerals to put their address on their house. When you are looking at text, the point is all you have are the characters/symbols/tokens/whatever you want to call them. You can't really shepherd some over to their numeric value while leaving some at their token value. Unless you want to cause other issues when it comes time to reason about them later.

I'd hazard that the majority of numbers in most text are not such that they should be converted to a number, per se. Consider addresses, postal codes, phone numbers, ... ok, I may have run out of things to consider. :D

TZubiri · 2024-10-14T20:05:56 1728936356

Perhaps I'm just missing fundamentals on tokenization.

But I fail to see how forcing tokenization at the digit level for numbers would somehow impact non numerical meanings of digits. The same characters always map to the same token through a simple mapping right? It's not like context and meaning changes tokenization:

That is:

my credit card ends in 4796 and my address is N street 1331

Parses to the same tokens as:

Multiply 4796 by 1331

So by tokenization digits we don't introduce the problem of different meanings to tokens depending on context.

taeric · 2024-10-14T22:22:18 1728944538

I think I see your point, but how would you want to include localized numbers, such as 1,024 in a stream? Would you assume all 0x123 numbers are hex, as that is a common norm? Does the tokenizer already know to read scientific numbers? 1e2, as an example?

That is all to say that numbers in text are already surprisingly flexible. The point of taking the tokens is to let the model lean the flexibility. It is the same reason that we don't tokenize at the word level. Or try to get a soundex normalization. All of these are probably worth at least trying. May even do better in some contexts? The general framework has a reason to be, though.

soulofmischief · 2024-10-11T14:24:32 1728656672

I think that as long as the attention mechanism has been trained on each possible numerical token enough, this is true. But if a particular token is underrepresented, it could potentially cause inaccuracies.

sva_ · 2024-10-11T14:39:42 1728657582

It won't 'see' [123, 45] though, but [7633, 2548], or rather sparse vectors that are zero at each but the 7634th and 2549th position.