I'd like to know which chatbot I should leverage for a particular task, as I assume different tools are better suited for different applications.
I've seen formal studies that have examined different dimensions of LLM chatbot performance (e.g. informational or linguistic quality, logical reasoning, creativity), and many anecdotal reports by the HN commentariat. I assume these analyses become outdated quickly, considering the rate at which the tools are evolving.
Are there entities that are evaluating LLM's and publishing the results as quickly?
There is also the Open LLM Leaderboard by HuggingFace (https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderb...) which aggregates a number of benchmarks, some (eg, MMLU) more trustworthy than others (eg, TruthfulQA). There are real concerns, however, that ML practitioners are gaming the leaderboard by contaminating their training data with evaluation data.
There are a number of other leaderboards such as OpenCompass (https://rank.opencompass.org.cn/leaderboard-llm-v2) and Yet Another LLM Leaderboard (https://huggingface.co/spaces/mlabonne/Yet_Another_LLM_Leade...) that I have seen suggested although I personally have found the most success with Chatbot Arena and the Open LLM Leaderboard.
I would also suggest checking out the LLM Explorer (https://llm.extractum.io/), which has all of these benchmarks and more in single location and allows you to sort and filter by a wide range of variables. That has been particularly helpful for me when trying to find models that will fit on my GPUs.
N.B. I am not affiliated with any of the benchmarks and services mentioned above.