Hacker News new | past | comments | ask | show | jobs | submit login

Yes, although Chinchilla seems to imply that training data size matters a lot more than parameter count, and nanoGPT author is trying to reproduce that here:

https://github.com/karpathy/nanoGPT/blob/master/scaling_laws...




I was also a bit surprised that the Chinchilla numbers and tables don't reproduce and that there are calculation bugs in the paper (e.g. the FLOPs calculation in the paper is wrong), especially because the paper has been so impactful in the field. Maybe people are focusing on the broad themes of the paper (e.g. scale model and data approx. in tandem) and just roughly interpolating the main Figure, without sweating the details. The corresponding authors responded very kindly at first and I was able to bring the results closer but now they went dark. Still hoping to make things match, if others in LLM space can spot any issues in my own reproduction please let me know.


Oh, that's really interesting, and makes sense intuitively. From the abstract:

> We find that current large language models are significantly under-trained, a consequence of the recent focus on scaling language models whilst keeping the amount of training data constant ... the model size and the number of training tokens should be scaled equally: for every doubling of model size the number of training tokens should also be doubled.

Assuming the GPT-3 authors know this, one could surmise they 10x'ed the number of training tokens also.

Edit: Should have kept reading. Sounds like GPT-3 was found to be undertrained.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: