Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Ask HN: Where can I find a real-time comparison of LLM chatbot performance?
15 points by astrobotanical on Feb 10, 2024 | hide | past | favorite | 6 comments
I'd like to know which chatbot I should leverage for a particular task, as I assume different tools are better suited for different applications.

I've seen formal studies that have examined different dimensions of LLM chatbot performance (e.g. informational or linguistic quality, logical reasoning, creativity), and many anecdotal reports by the HN commentariat. I assume these analyses become outdated quickly, considering the rate at which the tools are evolving.

Are there entities that are evaluating LLM's and publishing the results as quickly?




LMSYS’ Chatbot Arena (https://huggingface.co/spaces/lmsys/chatbot-arena-leaderboar...) is widely regarded as one of, if not, the most reliable open benchmarks for LLMs. Real users provide prompts to chatbots and then blindly pick the best response. The only drawback is that the leaderboard is restricted to the most popular models and, even then, it can take a while for new models to be added. This is understandable given the considerable ongoing costs associated with continuously updating the leaderboard.

There is also the Open LLM Leaderboard by HuggingFace (https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderb...) which aggregates a number of benchmarks, some (eg, MMLU) more trustworthy than others (eg, TruthfulQA). There are real concerns, however, that ML practitioners are gaming the leaderboard by contaminating their training data with evaluation data.

There are a number of other leaderboards such as OpenCompass (https://rank.opencompass.org.cn/leaderboard-llm-v2) and Yet Another LLM Leaderboard (https://huggingface.co/spaces/mlabonne/Yet_Another_LLM_Leade...) that I have seen suggested although I personally have found the most success with Chatbot Arena and the Open LLM Leaderboard.

I would also suggest checking out the LLM Explorer (https://llm.extractum.io/), which has all of these benchmarks and more in single location and allows you to sort and filter by a wide range of variables. That has been particularly helpful for me when trying to find models that will fit on my GPUs.

N.B. I am not affiliated with any of the benchmarks and services mentioned above.


https://chat.lmsys.org/ is the go-to for general comparison (click leaderboard tab)

As for task specific, probably have to dig into papers. Typically, if you have something non-generic, you'll want to fine-tune


Start with LLM Explorer. https://llm.extractum.io


Wouldn't it be simpler and more accurate for you to just try X variations of your personal prompt(s) across Y services over Z runs? There's so much variance that no single study would capture every use case.


Different LLMs require different strategies, the same prompt rarely works well across vendors, due to training


Try them yourself.

Ask the samy question is several chatbots, and disqualify the bad ones.




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: