The emerging consensus for larger LLM is you want to train them with at least 2-...

sitic · on March 20, 2023

The LLaMA paper contradicts this view: "[...] Although Hoffmann et al. (2022) recommends training a 10B model on 200B tokens, we find that the performance of a 7B model continues to improve even after 1T tokens." https://arxiv.org/pdf/2302.13971.pdf

sebzim4500 · on March 20, 2023

They probably put most of the effort into the 65B model, the 7B model was just trained so they could get an idea of the scaling behaviour. It makes sense to use the same amount of training steps, then.