Scaling Data-Constrained Language Models

williamtrask · on May 31, 2023

> “Motivated by this limit, we investigate scaling language models in data-constrained regimes. Specifically, we run a large set of experiments varying the extent of data repetition and compute budget, ranging up to 900 billion training tokens and 9 billion parameter models. We find that with constrained data for a fixed compute budget, training with up to 4 epochs of repeated data yields negligible changes to loss compared to having unique data. However, with more repetition, the value of adding compute eventually decays to zero.”

kelseyfrog · on May 31, 2023

The thought just occurred to me, does training data ordering have any impact on LLM loss? It's not a terrible experiment to try, right?

There reason I'm asking is because we tend to educate humans using a fairly well-ordered path of curriculum. Take, for example, children's books. The limited vocab, sentence complexity, and concreteness of ideas are all supposed to help them learn better. In other words, parents aren't trying to teach their children language by reading them to sleep from The Pile[1]. Doing so would be considered detrimental to language learning. Like I said, just a thought.

1. https://arxiv.org/abs/2101.00027

nmfisher · on May 31, 2023

Research [0] suggests that pretraining to uncover sparse subnetworks is faster when using "easy" subsets of the training data. Very small part of the overall picture but I do expect to see more research on data hashing/subsetting/ordering for training optimization in the near future.

[0] https://arxiv.org/pdf/2206.01278.pdf

airgapstopgap · on May 31, 2023

At the very least, I think transformers benefit from progressive increase in sample length. But building a principled curriculum based on abstract semantic-level properties of the content doesn't seem to work, or we don't know how how prioritize it.

warkdarrior · on May 31, 2023

SGD fundamentally relies on randomly sampling the training data.