Yeah that’s my understanding of the root cause. It can also cause weirdness with...

versteegen · 2025-01-21T20:29:40 1737491380

I believe DeepSeek models do split numbers up into digits, and this provides a large boost to ability to do arithmetic. I would hope that it's the standard now.

maxrmk · 2025-01-22T00:00:13 1737504013

Could be the case, I’m not familiar with their specific tokenizers. IIRC llama 3 tokenizes in chunks of three digits. That seems better than arbitrary sized chunks with BPE, but still kind of odd. The embedding layer has to learn the semantics of 1000 different number tokens, some of which overlap in meaning in some cases and not in others, e.g 001 vs 1.