It really feels like the reason this is being released now and not months ago is that that's how long it took them to figure out the convoluted combination of different evaluation procedures to beat GPT-4 on the various benchmarks.
"Dearest LLM: Given the following raw benchmark metrics, please compose an HTML table that cherry-picks and highlights the most favorable result in each major benchmark category"
Even not having a moat anymore, with their cash they might still be the biggest search provider 10 years from now. IBM still exists and is worth 146B. I wouldn't be surprised if Google still came out ok.
Assuming they use unique data only they have to make a better LLM, then everyone is going to leech training examples from them bringing competition asymptotically closer, but never quite reaching. It's hard to copy-protect a model exposed to the public, as OpenAI is finding out.
Many, many tasks can be executed on local GPUs today without paying a dime to OpenAI, there is no moat. AI likes to learn from other AIs. Give me a million hard problems solved step by step with GPT-5 and I can make Mistral much smarter. Everyone knows this dataset is going to leak in a few months.
Why is that misleading? It shows Gemini with CoT is the best known combination of prompt and LLM on MMLU.
They simply compare the prompting strategies that work best with each model. Otherwise it would be just a comparison of their response to specific prompt engineering.
The places where they use the same methodology seem within the error bars of the cherry picked benchmarks they selected. Maybe for some tasks it's roughly comparable to GPT4 (still a major accomplishment for Google to come close to closing the gap for the current generation of models), but this looks like someone had the goal of showing Gemini beating GPT4 in most areas and worked back from there to figure out how to get there.
For MMLU, it highlights the CoT @ 32 result, where Ultra beats GPT4, but it loses to GPT4 with 5-shot, for example.
For GSM8K it uses Maj1@32 for Ultra and 5-shot CoT for GPT4, etc.
Then also, for some reason, it uses different metrics for Ultra and Pro, making them hard to compare.
What a mess of a "paper".