Hacker News new | past | comments | ask | show | jobs | submit login
Training LLMs over Neurally Compressed Text (arxiv.org)
10 points by wseqyrku 4 months ago | hide | past | favorite | 3 comments



Can someone explain why this is better than using a larger tokenizer? To me it seems like this would just make the LLM have a harder time understanding the content (when a token might have multiple meanings and isn't full, it can't have a good embedding)


A token already has multiple meanings because words (and part-of-words) can have multiple meanings.


Sure, there is some of the same problem with current tokenizers. However, I think this would increase it from "some tokens aren't words" to "(almost) all tokens aren't words". Correct me if I'm missing something though.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: