Hacker News new | past | comments | ask | show | jobs | submit | gregpr07's comments login

Creator of Browser Use here. This looks super cool! How do you spin them up so fast, firecracker?


We keep "prewarmed" browsers ready for new sessions. Looking into firecracker though!


Nice!! We are working on a higher level library over at https://github.com/gregpr07/browser-use.


Looks awesome! DM'd :)


Technically you can yeah, I am not sure if the performance is the same - would have to test it!


Who (or still in stealth haha)?


I'll leave it to them to announce once they're ready, but it's certainly a pretty future-looking play.


For that we wanted to give you more control as well with a for loop and you take actions step by step. I think all of these crewai like agent swarms are also very much black boxes.

How would you imagine the perfect scenario? What would make LLM outputs less of a black box?


Nope: 1280x1024 low resolution with gpt-4o are 85 tokens so approx $0.0002 (so 100x cheaper). For high resolution its apporx $0.002 https://openai.com/api/pricing/


Yeah. I noticed a very low cost when I run it via vm, predefined resolution. Good tip.


Could you elaborate on the CLI idea? I am intrigued but not exactly sure what you mean.


(handwaving) I'd rather be in a loop of "here's our goal. here's latest output from the CLI. what do we type into the CLI" than the GUI version of that loop.

I hope that's clearer, I'm a bit over-caffeinated


Hmm, but this how we handle it? We just have a CLI that outputs exactly, goal, state, and asks user for more clarity if needed, no GUI. The original idea was to make it completely headless.


I'm sorry I'm definitely off today, and am missing it, I appreciate your patience.

I'm thinking maybe the goal/state stuff might have clouded my point. Setting aside prompt engineering, just thinking of the stock AI UIs today, i.e. chat based.

Then, we want to accomplish some goal using GUI and/or CLI. Given the premise that I'd avoid GUI automation, why am I saying CLI is the way to go?

A toy example: let's say the user says "get my current IP".

If our agent is GUI-based, maybe it does: open Chrome > type in whatismyip.com > recognize IP from screenshot.

If our agent is CLI-based, maybe it does: run the curl command to fetch the user's IP from a public API (e.g. curl whatismyip.com) > parse the output to extract the IP address > return the IP address to the user as text.

In the CLI example, the agent interacts with the system using native commands (in this case, curl) and text outputs, rather than trying to simulate GUI actions and parse screenshot contents.

Why do I believe thats preferable over GUI-based automation?

1. More direct/efficient - no need for browser launching, screenshot processing, etc.

2. More reliable - dealing with only structured text output, rather than trying to parse visual elements

3. Parallelizable: I can have N CLI shells, but only 1 GUI shell, which is shared with the user.

4. In practice, I'm basing that off observations of the GUI-automation project I mentioned, accepting computer automation is desirable, and...work I did to build an end-to-end testing framework for devices paired to phones, both iOS and Android.

What the? Where did that come from?

TL;DR: I love E2E tests, for years, and it was stultifying to see how little they were used beyond the testing team due to flakiness. Even small things like "Launch the browser" are extremely fraught. How long to wait? How often do we poll? How do we deal with some dialog appearing in front of the app? How do we deal with not having the textual view hierarchy for the entire OS?


try openinterpreter for cli computer automation.


Nice, thanks


I don’t know a lot about this but do you have full power of Selenium or not? That would be also very interesting aproach especially when “local” browser models get very good


From 3 days playing around it, I couldn’t find a way to use selenium or playwright in the browser.

What I did though is having a loop to send instructions from playwright.

For instance, I will open the browser, and then enter a loop to await for instructions (can be from event such as redis) to execute again in the same browser. But still, it’s based on the session instantiated by playwright.


Thanks man, starred yours too, it's super cool to see all these projects getting spun up!

I see Cerebellum is vision only. Did you try adding HTML + screenshot? I think that improves the performance like crazy and you don't have to use Claude only.

Just saw Skyvern today on previous Show HNs haha :)


I had an older version that used simplified HTML, and it got to decent performance with GPT-4o and Gemini but at the cost of 10x token usage. You are right, identifying the interactable elements and pulling out their values into a prompt structure to explicitly allow the next actions can boost performance, especially if done with grammar like structured outputs or guidance-llm. However, I saw that Claude had similar levels of performance with pure vision, and I felt that vision + more training would beat a specialized DOM algorithm due to "the bitter lesson".

BTW I really like your handling of browser tabs, I think it's really clever.


Fair, also Claude probably only gets better on this since they kinda want people to use Computer use. We are gonna try to do best of both worlds.

Thanks man, Magnus came up with it this morning haha!


I starred both of you


Looks nice. I find the cleaning HTML step in our cleaning pipeline extremely important, otherwise there is no real benefit from just using a general vision model and clicking coordinates (and whole HTML is just way too many tokens). How do you guys handle that?


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: