Hacker News new | past | comments | ask | show | jobs | submit login

Ah, so the model "sees" the tags as literal ASCII characters interspersed with special tokens? That would make more sense.





More or less; they’re not literally the same tokens as “a”, “b”, “c” but I’d speculate the mapping is learned from some other examples of ASCII (or just Roman letters) being repeated in other obscure parts of Unicode — Gothic glyphs, bubble letters, etc. Once the model has seen enough ASCII represented as Unicode code points whose tokenizations alternate between meaningless and meaningful (e.g. “~l~i~k~e~ ~t~h~i~s”) it learns how to read it regardless of what the ”~” is.



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: