If we expect that the demand for GPT-5 in AI compute is 100x of that of GPT-4 th...

If we expect that the demand for GPT-5 in AI compute is 100x of that of GPT-4 then if GPT-4 was trained in months on 10k of H100 then you would need years with 100k of H100 or maybe again months with 100k of GB200.

See, there is your answer. The issue is the compute of GPUs is way to low yet for GPT-5 if they continue parameter scaling as they used to do.

GPT3 took months on 10k A100s. 10k H100 would have done it in a fraction of a time. Blackwell could train GPT4 in 10 days with same amount of GPUs as Hopper which took months.

Don't forget GPT3 is just 2.5 years old. Training is obviously waiting for the next step up in large clusters of training speed increasement. Don't be fooled, the 2x Blackwell vs. Hopper is only chip vs. chip. 10k of Blackwell including all networking speedup is easily 10x or more faster than the same amount of Hopper. So building a 1 million Blackwell cluster means 100x more training compute compared to a 100k Hopper cluster.

Nobody starts a model training if it takes years to finish... too much risk in that.

Transfer model was introduced in 2017 and ChatGPT came out 2022. Why? Because they would have needed millions of Volta GPUs instead of thousands of Ampere GPUs to train it.