We need some international body to start running these tests… I just can’t trust these numbers any longer. We need a platform for this, something at least we can get some peer reviews
I’m working on this at STAC Research and looking to connect with others interested in helping. Key challenges are ensuring impartiality (and keeping it that way), making benchmarks ungameable, and guaranteeing reproducibility. We’ve done similar work in finance and are now applying the same principles to AI.
Sure! STAC Research has been building and running benchmarks in finance for ~18 years. We’ve had to solve many of the same problems I think you’re highlighting here.. e.g. tech & model providers tuning specifically for the benchmark, results that get published but can’t be reproduced outside the provider’s lab, etc.
The approach is to use workloads defined by developers and end users (not providers) that reflect their real-world tasks. E.g. in finance, delivering market snapshots to trading engines. We test full stacks, holding some layers constant so you can isolate the effect of hardware, software, or models. Every run goes through an independent third-party audit to ensure consistent conditions, no cherry-picking of results, and full disclosure of config and tuning, so that the results are reproducible and the comparisons are fair.
In finance, the benchmarks are trusted enough to drive major infrastructure decisions by the leading banks and hedge funds, and in some cases to inform regulatory discussions, e.g. around how the industry handles time synchronization.
Now starting to apply the same principles to the AI benchmarking space. Would love to talk to anyone who wants to be involved?