Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Show HN: Skyvern 2.0 – open-source AI Browser Agent scoring 85.8% on WebVoyager (skyvern.com)
9 points by suchintan on Jan 16, 2025 | hide | past | favorite | 3 comments
Hey HN,

We’re Suchintan and Shu from Skyvern (https://www.skyvern.com). We’re building an open source AI Agent that can browse the web and take actions. Our open source repo can be found at https://github.com/Skyvern-AI/Skyvern.

We’ve re-built Skyvern with a Planner-Actor-Validator agent architecture and achieved 85.8% state of the art (SOTA) on the WebVoyager Benchmark. You can see the results for yourself here: https://eval.skyvern.com/

For reference, here were the previous SOTA results: 83.5% - Google Mariner (https://deepmind.google/technologies/project-mariner/) 73.1% - AgentE (https://arxiv.org/html/2407.13032v1) 67.0% - HCompany (https://www.hcompany.ai/blog/a-research-update) 59.1% - WebVoyager (https://arxiv.org/html/2401.13919v4) 52.6% - WILBUR (https://arxiv.org/html/2404.05902v1) 52.0% - Claude Computer Use (https://docs.anthropic.com/en/docs/build-with-claude/compute...)

Achieving this SOTA result required expanding Skyvern’s original architecture. Skyvern 1.0 involved a single prompt operating in a loop both making decisions and taking actions on a website. This approach was a good starting point, but scored ~45% on the WebVoyager benchmark because it had insufficient memory of previous actions and could not do complex reasoning.

We re-built this all using a Planner-Actor-Validator agent architecture: 1. Planner - Decides that goals to accomplish on a website, and maintains a working memory of the overall goal and progress towards it 2. Actor - Given a narrowly scoped goal, executes the goal on the website, reporting back 3. Validator - Asserts whether the goal was successfully achieved and passes feedback back to the Actor + Planner

We ran the benchmark on Skyvern cloud to test Skyvern 2.0 in a real-world environment – autonomously navigating the web in a remotely hosted browser without any human involvement.

To keep with our open source mission, we decided to publish benchmark, modifications, and final results for anyone to review. This is important because we’re seeing an increasing trend of companies publishing their benchmarks with no way to access the results, so we’ve decided to make everything public.

[1] Eval Dataset: https://github.com/Skyvern-AI/skyvern/tree/main/evaluation/d... [2] Modifications: https://github.com/Skyvern-AI/skyvern/pull/1576/commits/60dc... [3] Each run (incl prompts + responses) can be inspected here: https://eval.skyvern.com/

The full report (incl an architecture diagram) can be found here: https://blog.skyvern.com/skyvern-2-0-state-of-the-art-web-na...

If you’d like to give Skyvern a try, you can grab the open source version (https://github.com/Skyvern-AI/Skyvern) or the cloud version (https://app.skyvern.com/) and give it a go and share any feedback with us. We look forward to any and all of your comments!



Congrats on the launch, and thanks for open-sourcing it! We recently gave it a go to browse third-party API docs and were surprised by the accuracy of the summaries! I'm curious, why do you think the actor model works better for LLMs?


It creates a feedback loop for the LLM to help it correct hallucinations or misunderstandings about how it's planned actions actually played out!


Very interesting. How are the evaluations done? i.e how do you categories a task as fail or success?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: