IsoFLOP curves of large language models are flat

z4y5f3 · on Aug 2, 2024

What they missed is that current scaling laws (OpenAI, Deepmind Chinchilla) are based on the assumption that the model is trained for one epoch. This essentially means that in order to scale compute, you will have to scale the model size and/or the size of the dataset. So Meta cannot simply spend 3.8e25 FLOPs on a 70B model - to do this they must find 86T pretraining tokens which they do not have.

Of course, ultimately we will figure out scaling laws for LLMs trained on multiple epochs of data, but not today.

ActivatedAI · on Aug 2, 2024

There is some good published research about doing multiple passes over the training data, and how quickly learning saturates. The TL:DR is that diminishing returns kicks in after about 4 epochs.

https://arxiv.org/abs/2305.16264

z4y5f3 · on Aug 2, 2024

Yep I have seen this paper before, and thank you for linking it here for reference. My personal opinion is that compared to single epoch scaling laws, we still need more evidence and literature on effects of multiple epochs, but this paper is one of the best results we have so far on using multiple epochs.

skyde · on Aug 3, 2024

But inside on epoch there is a lot of duplication already.

By duplication I mean if context length is N there is many sequence of N word that are not unique.

radarsat1 · on Aug 2, 2024

> So, these models are basically within 1% of each other in terms of final pretraining loss.

How is this loss calculated though? Since it is called "loss" and not "performance metric", I'm going to assume it is the teacher forced cross entropy loss.

I'm not too familiar with LLM training but having been doing some fair amount of seq2seq training lately in other domains I've observed that the relationship between "loss" and autoregressive inference performance gets very narrow towards the end of training. What I mean is that smaller and smaller reductions in loss lead to better improvements in the autoregressive output. So at least I have some doubt that in practice that 1% loss improvement is not actually incredibly significant with respect to how well the model actually performs at inference time.

But I'm pretty interested in this topic and if people here have observed this or the contrary I'd be curious to know.

ActivatedAI · on Aug 2, 2024

Page 9 on the Llama tech report has an interesting graph that predicts task level performance from the cross-entropy loss. The sigmoidal model fits well, and at the steepest part of the S, a .01 change in NLL is worth about 5% task level accuracy.

Here is a quick screenshot if you are lazy.

https://snipboard.io/C6mipQ.jpg

And here is the paper if you want to dig deep into the highest quality published research on LLM frontier.

https://ai.meta.com/research/publications/the-llama-3-herd-o...

radarsat1 · on Aug 3, 2024

Thanks that's actually pretty much what I was thinking, I'll have to read that to try to understand the significance of the S curve there, pretty interesting. Appreciate the link!