> like the full texts of the subreddit dedicated to counting to a million
This was the source of the "anomalous tokens" phenomenon where the usernames of prolific counters was yielding weird and unexpected behavior on the OpenAI models.
While definitely an interesting scientific curiosity, is there a reason you'd actually want this in a production model?
This is not entirely correct, from what I understand. The source of the "anomalous token phenomenon" is that those texts were included while training the tokenizer but not included when training the models. It is not clear they would necessarily induce the same effect otherwise (i.e., if both the tokenizer and LLMs were trained with those "counting" texts).
EDIT: notice that the "tokens" that trigger the "glitch" are not the numbers themselves but the usernames of the people counting on that subreddit (which appear nowhere in the training dataset, due to a cleaning step that removed the "counting" texts)
This was the source of the "anomalous tokens" phenomenon where the usernames of prolific counters was yielding weird and unexpected behavior on the OpenAI models.
While definitely an interesting scientific curiosity, is there a reason you'd actually want this in a production model?