Hacker News new | past | comments | ask | show | jobs | submit login

Yeah that’s my understanding of the root cause. It can also cause weirdness with numbers because they aren’t tokenized one digit at a time. For good reason, but it still causes some unexpected issues.



I believe DeepSeek models do split numbers up into digits, and this provides a large boost to ability to do arithmetic. I would hope that it's the standard now.


Could be the case, I’m not familiar with their specific tokenizers. IIRC llama 3 tokenizes in chunks of three digits. That seems better than arbitrary sized chunks with BPE, but still kind of odd. The embedding layer has to learn the semantics of 1000 different number tokens, some of which overlap in meaning in some cases and not in others, e.g 001 vs 1.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: