Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Benchmark charts on model card: https://huggingface.co/01-ai/Yi-1.5-34B-Chat#benchmarks

Yi 34b with results similar to Llama 3 70b and Mixtral 8x22b

Yi 6b and 9b with results similar to Llama 3 8b



We need to wait for LMSYS Chatbot Arena to actually see the performance of the model.


I had good results with the previous Yi-34b and its fine tunes like Nous-Capybara-34B. Will be interesting to see what Chatbot Arena thinks but my expectations are high.

https://huggingface.co/NousResearch/Nous-Capybara-34B


No, Lmsys is just another very obviously flawed benchmark.


Flawed in some ways but still fairly hard to game and useful.


Please elaborate on this: how is it flawed?


It's horribly useless for most use cases since half of it is people probing for riddles that don't transfer to any useful downstream task, and the other half is people probing for morality. Some tiny portion is people asking for code, but every model has its own style of prompting and clarification that works best, so you're not going to be able to use a side-by-side view to get the best result.

The "will it tell me how to make meth" stuff is a huge source of noise, which you could argue is digging for refusals which can be annoying, and the benchmark claims to filter out... but in reality a bunch of the refusals are soft refusals that don't get caught, and people end up downvoting the model that's deemed "corporate".

Honestly the fact that any closed source model with guardrails can even place is a miracle, in a proper benchmark the honest to goodness gap between most closed source models and open source models would be so large it'd break most graphs.


This is so nonsensical it's hilarious, "corporate" models have always been at the top of the leaderboard.


Maybe just more nuanced a comment than you're used to. "Corporate" models are interspersed in a way that doesn't reflect their real world performance.

There aren't nearly as many 3.5 level models as the leaderboard implies for example.


Pretraining on the test set is all you need.

LLM benchmarks are horribly broken. IMHO there is better signal in just looking at parameter counts.


Looking at tokens they were trained on is also a really great indicator of world understanding. Llama 3 is a game changer for some usecases because there's finally a model that understands the world deeply as opposed to typical models which can be fine tuned into hyper specific tasks, but generalize poorly, especially in D2C usecases where someone might probe the model's knowledge




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: