They better not have removed the good stuff, like the full texts of the subreddit dedicated to counting to a million, the logs of so many hashed numbers from various cryptos, and the tables of datamined stats from like every console game.
> like the full texts of the subreddit dedicated to counting to a million
This was the source of the "anomalous tokens" phenomenon where the usernames of prolific counters was yielding weird and unexpected behavior on the OpenAI models.
While definitely an interesting scientific curiosity, is there a reason you'd actually want this in a production model?
This is not entirely correct, from what I understand. The source of the "anomalous token phenomenon" is that those texts were included while training the tokenizer but not included when training the models. It is not clear they would necessarily induce the same effect otherwise (i.e., if both the tokenizer and LLMs were trained with those "counting" texts).
EDIT: notice that the "tokens" that trigger the "glitch" are not the numbers themselves but the usernames of the people counting on that subreddit (which appear nowhere in the training dataset, due to a cleaning step that removed the "counting" texts)
Do they mention anywhere the definition of "low quality" data or the proportion of removed data that was low quality versus duplicate?
They mention "When upsampled, we expect SlimPajama to perform equal to or better than RedPajama-1T when training at trillion token scale." But i guess "upsampling" in this case is just explicit duplication of the training data. So the only potential gains would be from the removal of the low quality data?
> After removing punctuation, space symbols, newlines and tabs, we filtered out documents with less than 200 characters. These documents typically contain only meta data and no useful information.
> But i guess "upsampling" in this case is just explicit duplication of the training data.
Possibly, but duplication means weighing and that is important in unbalanced trainingsets and improves the results in practice.
I'm interested in seeing this scaled up to larger parameter size models (30b+ parameters), and the dataset expanded with more high-quality data (scientific papers, more books, more code, etc).