We run tons of evals with public datasets from LLM Arena on common categories such as law, financing task, code, maths, etc., and pair them up with the public benchmarks such as Natural2Code, GPQA, etc then tagging the benchmarks with relevant cost, speed benchmarks on a provider level. One of the weights of the Prompt Engine model is on the similarity of the outputs as we execute a user prompt on 5-6 different prompts. Higher quality outputs tend to be similar across at least 3 of the models.