1) Using the LLM to find elements/selectors in HTML
2) Use LLMs to fill out logical/likely/meaningful answers to things
I highly recommend you decouple these 2 efforts. While you gave a good example of "insurance quote step by step webapp", the vast majority of web scraping efforts are much more mundane.
Additionally, even in this instance, the selector brain/intelligence brain don't need to be coupled.
For example:
Selector brain: "Find/click the button for foreign drivers license."
Selector brain: "Find the country of origin field."
Selector brain: "Find the expiry date field."
LLM-intelligence brain: "Use values from prompt to fill out the country of origin and expiry date fields."
Not-LLM intelligence brain: Inputs values from a JSON object of documentSelector=>value.
Interesting. We've decoupled navigation and extraction for specifically this reason, but I suppose decoupling selector with input could let us use cheaper smaller LLMs to "select" and answer
We've been approaching it a little bit differently. We think larger more capable models would actually immediately improve the performance of Skyvern. For example, if you run it with LLaVa, the performance significantly degrades, likely because of the coupling
But since we use GPT-4V, and it's rumoured to be a MoE model, I wonder if there's implicit decoupling going on.
I'm gonna spend some more time thinking about this
I still think you're missing the point. The idea is that you should use vision APIs and LLMs to build traditional browser automation using a DSL or Python.
I don't want to use vision and LLMs for every page. I just want to use vision and LLMs to figure out what elements need to be clicked once. Or maybe every time the site changes the frontend.
The AI would be a compiler that generates the traditional scraper / integration test.
It would save all that long time spent going manually thought every page and figuring out which mistake we did, when that input string doesn't go into that input field or the button on the modal window is not clicked.
I didn’t check the code but there would be a few good ways to specify what you want:
* browser extension that lets you record a few actions
* describing what you want to do with text
* a url with one or two lines of desired JSON to extract
No, that's something completely different than what bravura is talking about, which is why he made a comment to say explicitly that he still thinks you're missing the point.
From your roadmap:
> Prompt Caching - Introduce a caching layer to the LLM calls to dramatically reduce the cost of running Skyvern (memorize past actions and repeat them!)
Adding a caching layer is not what they're asking for. They want to periodically use Skyvern to generate automation code, which they could then deploy themselves in their testing/CI setup. Eventually their target website may make breaking UI changes, then you use Skyvern to generate new automation code. Rinse and repeat. This has nothing to do with an internal caching layer within your service.
We've discussed generating automation code internally a bunch, and what we decided on is to do action generation and memorization, instead of code generation and memorization. They're not that far apart conceptually, but there is one important distinction: The generated output would just be a list of actions and their associated data source.
For example, if Skyvern was asked to log-in to a website and do a search for product X, the generated action plan would include:
1. Click the log in button
2. Click "sign in with email"
3. Input the email address retrieved from source X
4. Input the password retrieved from source Y
5. Click log in
6. Click on the search bar
7. Input the search term from source Z
8. Click Search
Now, if the layout changed and suddenly the log-in button had a different XPath, you have two options:
1. Re-generate the entire action plan (or sub-action plan)
2. Re-generate the specific component that broke and assume everything else in the action plan still works
1) Using the LLM to find elements/selectors in HTML
2) Use LLMs to fill out logical/likely/meaningful answers to things
I highly recommend you decouple these 2 efforts. While you gave a good example of "insurance quote step by step webapp", the vast majority of web scraping efforts are much more mundane.
Additionally, even in this instance, the selector brain/intelligence brain don't need to be coupled.
For example:
Selector brain: "Find/click the button for foreign drivers license." Selector brain: "Find the country of origin field." Selector brain: "Find the expiry date field."
LLM-intelligence brain: "Use values from prompt to fill out the country of origin and expiry date fields."
Not-LLM intelligence brain: Inputs values from a JSON object of documentSelector=>value.