One solution is to come up with a new benchmark yourself.
Manually benchmarking it by coming up with 20 questions and feeding it to a pair of models and blindly choosing the best result can give you a pretty good figure.
And that can probably be done in under 20 mins of human time.
Especially since most devs have a specific use case in mind. Coming up with tests that are tailored to your needs will always be more informative than off the shelf metrics.
You can ask gpt4 or other high value model to rate two chat logs for coherency etc, not as accurate as human evaluation, but you don't have to read thousand lines of text if comparing many models.
This is problematic if you are comparing a model in the same base family as the evaluator, as it will probably favor itself because it literally has the sequences it would naturally emit.
Manually benchmarking it by coming up with 20 questions and feeding it to a pair of models and blindly choosing the best result can give you a pretty good figure.
And that can probably be done in under 20 mins of human time.