Evaluating 55 LLMs with GPT-4

bradknowles · 2023-10-09T03:33:35

How is this benchmark not inherently biased towards GPT?

If I did the same sort of thing but used Claude to grade the tests, would I get similar results? Or would that be inherently biased towards Claude scoring high?

crashocaster · 2023-10-08T22:37:42

I always find evals of this flavor offputting given that 3.5 and 4 likely share preference models (or at least feedback data)

habitue · 2023-10-08T20:55:18

Should be evaluating each prompt multiple times to see how much variance in the scores there are. Even gpt-4 grading gpt-4 should probably be done several times

natsucks · 2023-10-09T16:53:59

Why no multi-turn evaluation? A lot of these benchmarks fail to capture the strength of ghost attention used in Llama 2 chat models.

ashu1461 · 2023-10-08T23:53:12

Any reason why palm or cohere models are not here ?

jasonjmcghee · 2023-10-09T15:48:06

Palm 2 is tied for #10

londons_explore · 2023-10-09T08:05:56

GPT-4-0314 is top of the league table (ie. Not the latest version, but the version released in March).

Is this our Concorde moment?

ionwake · 2023-10-09T00:20:15

Really cool thanks