> It sounds a lot like you’re saying “small models are just as good” … which is false. No one believes that. … a fully trained larger model is going to be better.
You're absolutely right, a fully trained larger model _will_ be better. This is meant to be under the context of OP of a "limited compute", the statement I'm trying to make is “fully trained small models are just as good as a undertrained large model”.
> …but surely, the laws of diminishing returns applies here?
They do but it's diminishing in that the performance gains of larger models becomes less and less, while the training time required changes a lot. If I'm reading the first chart of figure 2, page 5 correctly, you a 5B vs 10B, the 10B needs almost 10x the training time for a 10% loss gain. and its a similar jump from 1B to 5B. My understanding is at this also starts flattening out, and that loss gain from each 10x becomes gradually lower and lower.
> Over all of time that compute budget is not fixed.
Realistically there is an upper bound to your compute budget. If you needed 1000GPUS for 30 days for a small model, you need 1000GPUS for 300 days for that ~10% at these smaller sizes, or 10,000GPUS for 30 days... You're going to become limited very quickly by time and/or money. There's a reason openai said they aren't training a model larger then GPT 4 at the moment - I don't think they can scale it from what I think is a ~1~2T model.
You're absolutely right, a fully trained larger model _will_ be better. This is meant to be under the context of OP of a "limited compute", the statement I'm trying to make is “fully trained small models are just as good as a undertrained large model”.
> …but surely, the laws of diminishing returns applies here?
They do but it's diminishing in that the performance gains of larger models becomes less and less, while the training time required changes a lot. If I'm reading the first chart of figure 2, page 5 correctly, you a 5B vs 10B, the 10B needs almost 10x the training time for a 10% loss gain. and its a similar jump from 1B to 5B. My understanding is at this also starts flattening out, and that loss gain from each 10x becomes gradually lower and lower.
> Over all of time that compute budget is not fixed.
Realistically there is an upper bound to your compute budget. If you needed 1000GPUS for 30 days for a small model, you need 1000GPUS for 300 days for that ~10% at these smaller sizes, or 10,000GPUS for 30 days... You're going to become limited very quickly by time and/or money. There's a reason openai said they aren't training a model larger then GPT 4 at the moment - I don't think they can scale it from what I think is a ~1~2T model.