Authors reported perplexity only for small up to 3B weights models. On the other...

cs702 · 2024-02-28T15:49:11

If I understand the authors correctly, they trained the compared models on only 100B tokens, all drawn from RedPajama, to make the comparisons apples-to-apples. That's sensible. It allows for easier replication of the results. Otherwise, I agree with you that more extensive testing, after more extensive pretraining, at larger model sizes, is still necessary.

lr1970 · 2024-02-28T17:06:16

towards the end of the paper they mentioned training on 2T tokens.

cs702 · 2024-02-28T21:06:34

You're right. Thank you for pointing that out.