There are two things here: 1) Using the LLM to find elements/selectors in HTML 2...

suchintan · 2024-03-14T23:35:29 1710459329

Interesting. We've decoupled navigation and extraction for specifically this reason, but I suppose decoupling selector with input could let us use cheaper smaller LLMs to "select" and answer

We've been approaching it a little bit differently. We think larger more capable models would actually immediately improve the performance of Skyvern. For example, if you run it with LLaVa, the performance significantly degrades, likely because of the coupling

But since we use GPT-4V, and it's rumoured to be a MoE model, I wonder if there's implicit decoupling going on.

I'm gonna spend some more time thinking about this

bravura · 2024-03-15T00:11:06 1710461466

I still think you're missing the point. The idea is that you should use vision APIs and LLMs to build traditional browser automation using a DSL or Python.

I don't want to use vision and LLMs for every page. I just want to use vision and LLMs to figure out what elements need to be clicked once. Or maybe every time the site changes the frontend.

suchintan · 2024-03-15T00:22:33 1710462153

This is a great point. This is something already on our roadmap. We call it "prompt caching", but I realize writing this that it's a terrible name. Will update! (https://github.com/Skyvern-AI/Skyvern?tab=readme-ov-file#fea...)

Thank you for this feedback

pmontra · 2024-03-15T07:18:02 1710487082

The AI would be a compiler that generates the traditional scraper / integration test.

It would save all that long time spent going manually thought every page and figuring out which mistake we did, when that input string doesn't go into that input field or the button on the modal window is not clicked.

Change the UI? Recompile with the AI.

bravura · 2024-03-15T07:35:50 1710488150

I didn’t check the code but there would be a few good ways to specify what you want:

* browser extension that lets you record a few actions * describing what you want to do with text * a url with one or two lines of desired JSON to extract

epr · 2024-03-15T13:45:24 1710510324

> We call it "prompt caching"

No, that's something completely different than what bravura is talking about, which is why he made a comment to say explicitly that he still thinks you're missing the point.

From your roadmap:

> Prompt Caching - Introduce a caching layer to the LLM calls to dramatically reduce the cost of running Skyvern (memorize past actions and repeat them!)

Adding a caching layer is not what they're asking for. They want to periodically use Skyvern to generate automation code, which they could then deploy themselves in their testing/CI setup. Eventually their target website may make breaking UI changes, then you use Skyvern to generate new automation code. Rinse and repeat. This has nothing to do with an internal caching layer within your service.

suchintan · 2024-03-15T14:07:29 1710511649

We've discussed generating automation code internally a bunch, and what we decided on is to do action generation and memorization, instead of code generation and memorization. They're not that far apart conceptually, but there is one important distinction: The generated output would just be a list of actions and their associated data source.

For example, if Skyvern was asked to log-in to a website and do a search for product X, the generated action plan would include: 1. Click the log in button 2. Click "sign in with email" 3. Input the email address retrieved from source X 4. Input the password retrieved from source Y 5. Click log in 6. Click on the search bar 7. Input the search term from source Z 8. Click Search

Now, if the layout changed and suddenly the log-in button had a different XPath, you have two options: 1. Re-generate the entire action plan (or sub-action plan) 2. Re-generate the specific component that broke and assume everything else in the action plan still works