Sure! STAC Research has been building and running benchmarks in finance for ~18 years. We’ve had to solve many of the same problems I think you’re highlighting here.. e.g. tech & model providers tuning specifically for the benchmark, results that get published but can’t be reproduced outside the provider’s lab, etc.
The approach is to use workloads defined by developers and end users (not providers) that reflect their real-world tasks. E.g. in finance, delivering market snapshots to trading engines. We test full stacks, holding some layers constant so you can isolate the effect of hardware, software, or models. Every run goes through an independent third-party audit to ensure consistent conditions, no cherry-picking of results, and full disclosure of config and tuning, so that the results are reproducible and the comparisons are fair.
In finance, the benchmarks are trusted enough to drive major infrastructure decisions by the leading banks and hedge funds, and in some cases to inform regulatory discussions, e.g. around how the industry handles time synchronization.
Now starting to apply the same principles to the AI benchmarking space. Would love to talk to anyone who wants to be involved?