The emerging consensus for larger LLM is you want to train them with at least 2-4x the tokens of the number of parameters (weights between neurons in the layers). A trillion (100x) surprises me.
The LLaMA paper contradicts this view:
"[...] Although Hoffmann et al. (2022) recommends training a 10B model on 200B tokens, we find that the performance of a 7B model continues to improve even after 1T tokens."
https://arxiv.org/pdf/2302.13971.pdf
They probably put most of the effort into the 65B model, the 7B model was just trained so they could get an idea of the scaling behaviour. It makes sense to use the same amount of training steps, then.