> “Motivated by this limit, we investigate scaling language models in data-constrained regimes. Specifically, we run a large set of experiments varying the extent of data repetition and compute budget, ranging up to 900 billion training tokens and 9 billion parameter models. We find that with constrained data for a fixed compute budget, training with up to 4 epochs of repeated data yields negligible changes to loss compared to having unique data. However, with more repetition, the value of adding compute eventually decays to zero.”
The thought just occurred to me, does training data ordering have any impact on LLM loss? It's not a terrible experiment to try, right?
There reason I'm asking is because we tend to educate humans using a fairly well-ordered path of curriculum. Take, for example, children's books. The limited vocab, sentence complexity, and concreteness of ideas are all supposed to help them learn better. In other words, parents aren't trying to teach their children language by reading them to sleep from The Pile[1]. Doing so would be considered detrimental to language learning. Like I said, just a thought.
Research [0] suggests that pretraining to uncover sparse subnetworks is faster when using "easy" subsets of the training data. Very small part of the overall picture but I do expect to see more research on data hashing/subsetting/ordering for training optimization in the near future.
At the very least, I think transformers benefit from progressive increase in sample length. But building a principled curriculum based on abstract semantic-level properties of the content doesn't seem to work, or we don't know how how prioritize it.