Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

We need some international body to start running these tests… I just can’t trust these numbers any longer. We need a platform for this, something at least we can get some peer reviews




I’m working on this at STAC Research and looking to connect with others interested in helping. Key challenges are ensuring impartiality (and keeping it that way), making benchmarks ungameable, and guaranteeing reproducibility. We’ve done similar work in finance and are now applying the same principles to AI.

That sounds amazing, mind telling us a little more?

Sure! STAC Research has been building and running benchmarks in finance for ~18 years. We’ve had to solve many of the same problems I think you’re highlighting here.. e.g. tech & model providers tuning specifically for the benchmark, results that get published but can’t be reproduced outside the provider’s lab, etc.

The approach is to use workloads defined by developers and end users (not providers) that reflect their real-world tasks. E.g. in finance, delivering market snapshots to trading engines. We test full stacks, holding some layers constant so you can isolate the effect of hardware, software, or models. Every run goes through an independent third-party audit to ensure consistent conditions, no cherry-picking of results, and full disclosure of config and tuning, so that the results are reproducible and the comparisons are fair.

In finance, the benchmarks are trusted enough to drive major infrastructure decisions by the leading banks and hedge funds, and in some cases to inform regulatory discussions, e.g. around how the industry handles time synchronization.

Now starting to apply the same principles to the AI benchmarking space. Would love to talk to anyone who wants to be involved?


Thank you, it’s quite brilliant to transfer those skill like this.

So the business model would be AI foundries contracting you for evaluating their models?

Do you envision some kind of freely accessible platform for consulting the results?


That sounds like an interesting idea to me. It would at least resolve the problem of companies gaming the metric.

Another approach might be the LiveBench approach where new tests are released on a regular basis.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: