In this case, it does, because the vocab is not a list or words, but a list of t...

In this case, it does, because the vocab is not a list or words, but a list of tokens. Each token may be a word, but it might also be a phrase or part of a word. The tokens are generated to be optimal on the input data - ie. for a given vocab size to minimize the number of tokens to represent it.

Therefore, the size of the vocab gives a good guide to the size of the data, since if there was 10x more english language data then the optimal distribution would be to dedicate more token space to english than russian.