More

suchintan · 2024-10-25T19:43:54 1729885434

Yep. This is totally fair feedback -- we're still a super early product and haven't had a chance to optimize the phone experience.. largely because it's tough to see the magic from the phone

We'll improve it soon!

suchintan · 2024-10-25T16:48:24 1729874904

Yes -- in theory. You'd need to use our workflows feature to get that set up and chain a few tasks together to collect that information!

suchintan · 2024-10-25T02:13:51 1729822431

Great to see you on here Anton!

Just curious, how do you differentiate from other open source competitors like n8n and Active pieces?

pneumaticteam · 2024-10-25T04:52:07 1729831927

Hey Suchintan!

You are my open-source Godfather ;) Our discussion was one of the final pushes in my long-debated decision to go open source. And your awesome launch of Skyvern on HN today inspired me to make a similar announcement. Thank you for your inspiring work!

I see n8n as an open-source, self-hosted alternative to Zapier, essentially orchestrating how systems communicate with each other. Pneumatic, on the other hand, focuses on human input and helps people navigate complex processes. In fact, many of our existing users use Pneumatic together with Zapier and n8n (mostly in Europe).

ActivePieces seems like a cool product I hadn’t seen before. I’ll need some time to play with it before I can form an opinion.

suchintan · 2024-10-25T00:57:32 1729817852

You're making some really good points here

1/ the current prompt + payload structure is definitely on the complicated end of the spectrum, but we've found that we can use an LLM to help generate this payload for our users

The technical users want to learn more and generate their own payloads, and the non technical users prompt LLMs to help them generate the ultimate skyvern prompt to get going

This was very unexpected -- but a surprisingly logical chain of events.

Phase 1: build the thing the complex way (playwright) Phase 2: build the playwright thing with complex prompts (we are here right now) Phase 3: build the thing that builds the playwright thing with simpler prompts

Each phase lowers the technical bar to build your automations

2/ re: frequency of website changes

This IMO is a smaller value prop of LLM based automations. The biggest one is being able to handle highly dynamic situations. Consider the case where you're automating an e-commerce website where the popup offer changes every week. skyvern doesn't even notice those, but playwright scripts would break

Similarly, I love using the Geico example because it highlights something that was very difficult to automate before: The form changes every time you run it

Skyvern breezes through it.. but another case that was hard to automate before.

3/ data correctness

We're actually rolling out a workflows feature that allows you to chain multiple tasks together. The cool thing about this feature is that you can add steps in to have Skyvern self-validate it's own unless before continuing.

For example, you can add n products to cart, then navigate to the cart and validate the cart state

... As you can guess, this creates the foundation to have another agent go and use these tools to self-build workflows with simpler prompts

TL;DR -- we're on a pretty long journey to use LLMs to make BPA easier and easier, and this is just the first step

suchintan · 2024-10-24T22:00:38 1729807238

1. Yes absolutely. But the issue is a little bit more nuanced than that. Websites without APIs don't have them for one of two reasons: (1) They want to protect their data (LinkedIn) or (2) can't be bothered to make an API (boutique websites, government portals). This solves that problem, but also makes it so these websites never have to build an API (after LLM costs go down).

2. We don't want Skyvern to be used on websites that prohibit this kind of behaviour (LinkedIn is the obvious example). Specifically, we didn't open source any of our anti-bot or captcha related code because we get requests to make "Reddit upvote rings" and such. We don't want to support bad actors like that

(3) I think this is a net net good thing. AI browser automations= less need for APIs = no need to maintain both an API and UI = streamlined experience + less code = simpler systems

(4) I'm not 100% sure about this one. We usually just assume companies don't build APIs because they don't have budget for it. Ie for non malicious reasons. Companies like LinkedIn will likely thwart any attempts at automation, but we're not interested in participating in this cat mouse game

rmbyrro · 2024-10-25T17:03:44 1729875824

> after LLM costs go down

I think 100 Gb of GPU memory will always cost multiples of CPU + regular memory.

Using LLMs and computer vision for these kinds of tasks only make sense in small scales. If the task is extensive and repeated frequently, you're better off using an LLM to generate a script using Selenium or whatever, then running that script almost for free (compared to LLM). O1 is very good at it, by the way. For the $0.10 of 1 page interaction charged by Skyvern, I can create several scripts using O1.

suchintan · 2024-10-24T20:28:57 1729801737

Depends on the scope of the changes. What did you have in mind?

rokhayakebe · 2024-10-24T20:57:44 1729803464

Maybe add a new page or update a link.

biosboiii · 2024-10-25T10:55:27 1729853727

you can use the official API for that, right? without having to pay ChatGPT and click pixels.

suchintan · 2024-10-24T20:00:30 1729800030

We have an open issue for this right now -- we would LOVE some contributions here. The biggest problem until Llama 3.2 came out was that most (good) open source llms were text-only, and Skyvern needs vision to perform well

This isn't true anymore -- we just need to build and launch support for it

socksy · 2024-10-24T23:28:37 1729812517

In theory to support ollama all you should need to do is be able to change the URL that would otherwise go to OpenAI, and select the model. The only gotcha is that the llama3.2 builds for ollama are currently text only — however they've just added support for arbitrary hugging face models so you're not limited by the officially supported models.

suchintan · 2024-10-24T19:21:12 1729797672

Give it a try! It's very capable of doing simple tasks like logging in and clicking around. You'll need to prompt assertions like "Complete if..." and "Terminate if..."

suchintan · 2024-10-24T19:12:19 1729797139

Agreed. In the short term (X months) I expect the HTML Distillation + giving text to LLMs to win out.. but the long term (Y years) screenshot only + pixels will definitely be the more "scalable" approach

One very subtle advantage of doing HTML analysis is that you can cut out a decent number of LLM calls by doing static analysis of the page

For example, you don't need to click on a dropdown to understand the options behind it, or scroll down on a page to find a button to click.

Certainly, as LLMs get cheaper the extra LLM calls will matter less (similar to what we're seeing happen with Solar panels where cost of panel < cost of labour now, but was reversed the preceding decade)

suchintan · 2024-10-24T18:46:57 1729795617

Not yet! We haven't shared them publicly yet because our internal dataset is super biased. Keep your eyes peeled though! They'll be coming out in the next few weeks :)