Training a "frontier model" without testing the architecture is very risky.
Meta trained the smaller Llama 3 models first, and then trained the 405B model on the same architecture once it had been validated on the smaller ones. Later, they went back and used that 405B model to improve the smaller models for the Llama 3.1 release. Mistral started with a number of small models before scaling up to larger models.
I feel like this is a fairly common pattern.
If Google had a bigger version of Gemini 2.0 ready to go, I feel confident they would have mentioned it, and it would be difficult to distill it down to a small model if it wasn't ready to go.
Meta trained the smaller Llama 3 models first, and then trained the 405B model on the same architecture once it had been validated on the smaller ones. Later, they went back and used that 405B model to improve the smaller models for the Llama 3.1 release. Mistral started with a number of small models before scaling up to larger models.
I feel like this is a fairly common pattern.
If Google had a bigger version of Gemini 2.0 ready to go, I feel confident they would have mentioned it, and it would be difficult to distill it down to a small model if it wasn't ready to go.