Yes, although Chinchilla seems to imply that training data size matters a lot mo...

karpathy · on Jan 11, 2023

I was also a bit surprised that the Chinchilla numbers and tables don't reproduce and that there are calculation bugs in the paper (e.g. the FLOPs calculation in the paper is wrong), especially because the paper has been so impactful in the field. Maybe people are focusing on the broad themes of the paper (e.g. scale model and data approx. in tandem) and just roughly interpolating the main Figure, without sweating the details. The corresponding authors responded very kindly at first and I was able to bring the results closer but now they went dark. Still hoping to make things match, if others in LLM space can spot any issues in my own reproduction please let me know.

programmarchy · on Jan 11, 2023

Oh, that's really interesting, and makes sense intuitively. From the abstract:

> We find that current large language models are significantly under-trained, a consequence of the recent focus on scaling language models whilst keeping the amount of training data constant ... the model size and the number of training tokens should be scaled equally: for every doubling of model size the number of training tokens should also be doubled.

Assuming the GPT-3 authors know this, one could surmise they 10x'ed the number of training tokens also.

Edit: Should have kept reading. Sounds like GPT-3 was found to be undertrained.