The table is \*highly\* misleading. It uses different methodologies all over the...

Imnimo · 2023-12-06T17:30:43 1701883843

It really feels like the reason this is being released now and not months ago is that that's how long it took them to figure out the convoluted combination of different evaluation procedures to beat GPT-4 on the various benchmarks.

mring33621 · 2023-12-06T17:41:51 1701884511

"Dearest LLM: Given the following raw benchmark metrics, please compose an HTML table that cherry-picks and highlights the most favorable result in each major benchmark category"

rvnx · 2023-12-06T17:40:49 1701884449

And somehow, when reading the benchmarks, Gemini Pro seems to be a regression compared to PaLM 2-L (the current Bard) :|

eurekin · 2023-12-06T17:40:25 1701884425

This, and also building the marketing website.

It feels really desperate

red-iron-pine · 2023-12-06T17:51:15 1701885075

"we have no moat"

visarga · 2023-12-06T18:23:25 1701887005

Even not having a moat anymore, with their cash they might still be the biggest search provider 10 years from now. IBM still exists and is worth 146B. I wouldn't be surprised if Google still came out ok.

Assuming they use unique data only they have to make a better LLM, then everyone is going to leech training examples from them bringing competition asymptotically closer, but never quite reaching. It's hard to copy-protect a model exposed to the public, as OpenAI is finding out.

Many, many tasks can be executed on local GPUs today without paying a dime to OpenAI, there is no moat. AI likes to learn from other AIs. Give me a million hard problems solved step by step with GPT-5 and I can make Mistral much smarter. Everyone knows this dataset is going to leak in a few months.

hulium · 2023-12-06T17:54:38 1701885278

Why is that misleading? It shows Gemini with CoT is the best known combination of prompt and LLM on MMLU.

They simply compare the prompting strategies that work best with each model. Otherwise it would be just a comparison of their response to specific prompt engineering.

noway421 · 2023-12-07T02:21:40 1701915700

> They simply compare the prompting strategies that work best with each model

Incorrect.

# Gemini marketing website, MMLU

- Gemini Ultra 90.0% with CoT@32*

- GPT-4 86.4% with 5-shot* (reported)

# gemini_1_report.pdf, MMLU

- Gemini Ultra 90.0% with CoT@32*

- Gemini Ultra 83.7% with 5-shot

- GPT-4 87.29% with CoT@32 (via API*)

- GPT-4 86.4% with 5-shot (reported)

Gemini marketing website compared best Gemini Ultra prompting strategy with a worse-performing (5-shot) GPT-4 prompting strategy.

viscanti · 2023-12-06T17:59:16 1701885556

The places where they use the same methodology seem within the error bars of the cherry picked benchmarks they selected. Maybe for some tasks it's roughly comparable to GPT4 (still a major accomplishment for Google to come close to closing the gap for the current generation of models), but this looks like someone had the goal of showing Gemini beating GPT4 in most areas and worked back from there to figure out how to get there.