For that we wanted to give you more control as well with a for loop and you take actions step by step. I think all of these crewai like agent swarms are also very much black boxes.
How would you imagine the perfect scenario? What would make LLM outputs less of a black box?
Nope:
1280x1024 low resolution with gpt-4o are 85 tokens so approx $0.0002 (so 100x cheaper). For high resolution its apporx $0.002
https://openai.com/api/pricing/
(handwaving) I'd rather be in a loop of "here's our goal. here's latest output from the CLI. what do we type into the CLI" than the GUI version of that loop.
Hmm, but this how we handle it? We just have a CLI that outputs exactly, goal, state, and asks user for more clarity if needed, no GUI.
The original idea was to make it completely headless.
I'm sorry I'm definitely off today, and am missing it, I appreciate your patience.
I'm thinking maybe the goal/state stuff might have clouded my point. Setting aside prompt engineering, just thinking of the stock AI UIs today, i.e. chat based.
Then, we want to accomplish some goal using GUI and/or CLI. Given the premise that I'd avoid GUI automation, why am I saying CLI is the way to go?
A toy example: let's say the user says "get my current IP".
If our agent is GUI-based, maybe it does: open Chrome > type in whatismyip.com > recognize IP from screenshot.
If our agent is CLI-based, maybe it does: run the curl command to fetch the user's IP from a public API (e.g. curl whatismyip.com) > parse the output to extract the IP address > return the IP address to the user as text.
In the CLI example, the agent interacts with the system using native commands (in this case, curl) and text outputs, rather than trying to simulate GUI actions and parse screenshot contents.
Why do I believe thats preferable over GUI-based automation?
1. More direct/efficient - no need for browser launching, screenshot processing, etc.
2. More reliable - dealing with only structured text output, rather than trying to parse visual elements
3. Parallelizable: I can have N CLI shells, but only 1 GUI shell, which is shared with the user.
4. In practice, I'm basing that off observations of the GUI-automation project I mentioned, accepting computer automation is desirable, and...work I did to build an end-to-end testing framework for devices paired to phones, both iOS and Android.
What the? Where did that come from?
TL;DR: I love E2E tests, for years, and it was stultifying to see how little they were used beyond the testing team due to flakiness. Even small things like "Launch the browser" are extremely fraught. How long to wait? How often do we poll? How do we deal with some dialog appearing in front of the app? How do we deal with not having the textual view hierarchy for the entire OS?
I don’t know a lot about this but do you have full power of Selenium or not? That would be also very interesting aproach especially when “local” browser models get very good
From 3 days playing around it, I couldn’t find a way to use selenium or playwright in the browser.
What I did though is having a loop to send instructions from playwright.
For instance, I will open the browser, and then enter a loop to await for instructions (can be from event such as redis) to execute again in the same browser. But still, it’s based on the session instantiated by playwright.
Thanks man, starred yours too, it's super cool to see all these projects getting spun up!
I see Cerebellum is vision only. Did you try adding HTML + screenshot? I think that improves the performance like crazy and you don't have to use Claude only.
Just saw Skyvern today on previous Show HNs haha :)
I had an older version that used simplified HTML, and it got to decent performance with GPT-4o and Gemini but at the cost of 10x token usage. You are right, identifying the interactable elements and pulling out their values into a prompt structure to explicitly allow the next actions can boost performance, especially if done with grammar like structured outputs or guidance-llm. However, I saw that Claude had similar levels of performance with pure vision, and I felt that vision + more training would beat a specialized DOM algorithm due to "the bitter lesson".
BTW I really like your handling of browser tabs, I think it's really clever.
Looks nice. I find the cleaning HTML step in our cleaning pipeline extremely important, otherwise there is no real benefit from just using a general vision model and clicking coordinates (and whole HTML is just way too many tokens). How do you guys handle that?