Benchmark charts on model card: https://huggingface.co/01-ai/Yi-1.5-34B-Chat#ben...

GaggiX · on May 12, 2024

We need to wait for LMSYS Chatbot Arena to actually see the performance of the model.

tosh · on May 12, 2024

I had good results with the previous Yi-34b and its fine tunes like Nous-Capybara-34B. Will be interesting to see what Chatbot Arena thinks but my expectations are high.

https://huggingface.co/NousResearch/Nous-Capybara-34B

zone411 · on May 12, 2024

No, Lmsys is just another very obviously flawed benchmark.

CuriouslyC · on May 12, 2024

Flawed in some ways but still fairly hard to game and useful.

aubanel · on May 12, 2024

Please elaborate on this: how is it flawed?

BoorishBears · on May 13, 2024

It's horribly useless for most use cases since half of it is people probing for riddles that don't transfer to any useful downstream task, and the other half is people probing for morality. Some tiny portion is people asking for code, but every model has its own style of prompting and clarification that works best, so you're not going to be able to use a side-by-side view to get the best result.

The "will it tell me how to make meth" stuff is a huge source of noise, which you could argue is digging for refusals which can be annoying, and the benchmark claims to filter out... but in reality a bunch of the refusals are soft refusals that don't get caught, and people end up downvoting the model that's deemed "corporate".

Honestly the fact that any closed source model with guardrails can even place is a miracle, in a proper benchmark the honest to goodness gap between most closed source models and open source models would be so large it'd break most graphs.

GaggiX · on May 13, 2024

This is so nonsensical it's hilarious, "corporate" models have always been at the top of the leaderboard.

BoorishBears · on May 13, 2024

Maybe just more nuanced a comment than you're used to. "Corporate" models are interspersed in a way that doesn't reflect their real world performance.

There aren't nearly as many 3.5 level models as the leaderboard implies for example.

qeternity · on May 12, 2024

Pretraining on the test set is all you need.

LLM benchmarks are horribly broken. IMHO there is better signal in just looking at parameter counts.

BoorishBears · on May 13, 2024

Looking at tokens they were trained on is also a really great indicator of world understanding. Llama 3 is a game changer for some usecases because there's finally a model that understands the world deeply as opposed to typical models which can be fine tuned into hyper specific tasks, but generalize poorly, especially in D2C usecases where someone might probe the model's knowledge