> like the full texts of the subreddit dedicated to counting to a million This w...

fjkdlsjflkds · on June 11, 2023

This is not entirely correct, from what I understand. The source of the "anomalous token phenomenon" is that those texts were included while training the tokenizer but not included when training the models. It is not clear they would necessarily induce the same effect otherwise (i.e., if both the tokenizer and LLMs were trained with those "counting" texts).

EDIT: notice that the "tokens" that trigger the "glitch" are not the numbers themselves but the usernames of the people counting on that subreddit (which appear nowhere in the training dataset, due to a cleaning step that removed the "counting" texts)

orra · on June 11, 2023

> is there a reason you'd actually want this in a production model?

I think GP agrees with you, and they were being sarcastic to be funny. It's not always easy to tell in a text based medium.