Hacker News new | past | comments | ask | show | jobs | submit login

In this case, it does, because the vocab is not a list or words, but a list of tokens. Each token may be a word, but it might also be a phrase or part of a word. The tokens are generated to be optimal on the input data - ie. for a given vocab size to minimize the number of tokens to represent it.

Therefore, the size of the vocab gives a good guide to the size of the data, since if there was 10x more english language data then the optimal distribution would be to dedicate more token space to english than russian.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: