Leaderboards are getting harder and harder as a decision tool. What does it mean to be better 0.7% or 1.6%. How does that help me? Is higher always better? What are the trade offs? Evals continue be the hardest most important parts of LLMs and tools that use them
Totally agreed. Only good as inputs to a stack decision but not the deciding factor. WebVoyager itself feels like it's approaching saturation though as scores as getting high and it only tests on a narrow domain of use-cases. We'll definitely see more challenging and interesting evals pop up in the next little bit.
IMHO, if you're building AI products, most of the time building and running your own evals is the only right way to build something good.
BTW - Arch looks super cool! Just starred and looking forward to playing around with it :)
Hey all! Wanted to share this leaderboard we put together to centralize some of the different models available in the AI browser agent space.
Since working on Steel, we've seen a ton of people have a hard time putting the browser agent space and how it's progressing into perspective and it felt odd to us that there were no centralized leaderboards like there were for so many other agentic use cases.
So we launched this leaderboard to help! It's open-source and we're open to any contributions we may be missing. We're committed to keeping this up to date as the space progresses (which it seems to be doing quite quickly).
reply