Hacker News new | past | comments | ask | show | jobs | submit login

Authors reported perplexity only for small up to 3B weights models. On the other hand, they reported throughput for 70B model, but not its performance (perplexity, end-to-end tasks). Very unfortunate omission. Overall, the paper is rather poorly written.



If I understand the authors correctly, they trained the compared models on only 100B tokens, all drawn from RedPajama, to make the comparisons apples-to-apples. That's sensible. It allows for easier replication of the results. Otherwise, I agree with you that more extensive testing, after more extensive pretraining, at larger model sizes, is still necessary.


towards the end of the paper they mentioned training on 2T tokens.


You're right. Thank you for pointing that out.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: