Thanks! I think the right solution is 100% a mixture. Currently thinking it should be mostly deterministic with some intentional/limited usage of runtime AI. And then AI debugging tooling on top of that
For our scripts running in prod we handle this in 2 ways:
- We use runtime agents in very specific places. For example on Availity they frequently have popups right after you login, so if there's a failure right after signup we spin up an agent to close it and then resume the flow with basically a try/catch
- We wait for it to fail and then tell the agent to look at the error logs and use `libretto run` command to rerun the workflow and fix the error
We're thinking of extending libretto to handle these better though. Some of our ideas:
- Adding a global/custom fallback steps to every page action. This way we could for example add our popup handler error recovery to all page actions or some subset of them
- Having a hosted version which flags errors and has a coding agent examine the issue and open a PR with the fix
Answered similar question above responding to someone's MCP sampling comment. Spinning up a separate agent in the CLI was our initial approach but we switched to snapshot via API because of speed and reliability.
We can update the config though to allow you to set up snapshot through the CLI instead of going through the API!
Not totally sure I understand, but if you're talking about the snapshot command which requires an API key we initially had it spinning up a tmux session to analyze the snapshot instead of using the API. But we switched it to use the API for 2 reasons:
1. Noticed that the API was a couple seconds faster than spinning up the coding agent
2. Spinning up a separate agent you can't guarantee its behavior, and we wanted to enforce that only a single LLM call was run to read the snapshot and analyze the selector. You can guarantee this with an API call but not with a local coding agent
Sorry yeah it was a big vague, I was thinking about creating a Libretto MCP since it's a/the standard way to share AI tooling nowadays and that would make it usable in more contexts.
In that case, the protocol has a feature called "sampling" that allow the MCP server (Libretto) to send completion requests to the MCP client (the main agent/harness the user interacts with), that means that Libretto would not need its own LLM API keys to work, it would piggyback on the LLMs configured in the main harness (sampling support "picking" the style of model you prefer too - smart vs fast etc).
This is a great flag and something we want to spend more time experimenting with as we continue to build out the repo.
Right now we kind of have a mixture of the 2 approaches, but there's a large room for improvement.
- When libretto performs the code generation it initially inspects the page and sends the network calls/playwright actions using `snapshot` and `exec` tools to test them individually. After it's tested all of individual selectors and thinks it's finished, it creates a script and then runs the script from scratch. Oftentimes the generated script will fail, and that will trigger libretto to identify the failure and update the code and repeat this process until the script works. That iteration process helps make the scripts much more reliable.
- The way our `snapshot` command works is that we send a screenshot + DOM (depending on size may be condensed) to a separate LLM and ask it to figure out the relevant selectors. We do this to not pollute context of main agent with the DOM + lots of screenshots. As a part of that analyzers prompt we tell it to prefer selectors using: data-testid, data-test, aria-label, name, id, role. This just lives in the analyzer prompt and is not deterministic though. It'd be interesting to see if we can improve script quality if we add a hard constraint on the selectors or with different prompting.
I'm also curious if you have any guidance for prompt improvements we can give the snapshot analyzer LLM to help it pick more robust selectors right off the bat.
Agreed! One thing that we felt was missing from the existing MCP tools was user recording. For old and shitty healthcare websites it's easier to just show the workflow than explain it
The playwright codegen tool exists, but the script it generates is super simple and it can't handle loops or data extraction.
So for libretto we often use a mix of instructions + recording my actions for the agent. Makes the process faster than just relying on a description and waiting for the agent to figure out the whole flow
There are a couple ways to handle JS components rendered at runtime:
- Libretto prefers network requests over DOM interaction when possible, so this will circumvent a lot of complex JS rendering issues
- When you do need the DOM, playwright can handle a lot of the complexity out of the box: playwright will re-query the live DOM at action time and automatically wait for elements to populate. Libretto is also set up to pick selectors like data-testid, aria-label, role, id over class names or positional stuff that's likely to be dynamic.
- At the end of the day the files still live as code so you could always just throw a browser agent at it to handle a part of a workflow if nothing else works
Super cool! Please let me know how it goes. Since agents are so good at writing code, we think letting the agent rewrite/test the code on failure is better than just using a prompt at runtime
Right now libretto only captures HTTP requests, which the coding agent can use to determine how to perform the automation.
For more complex cases where libretto can't validate that the network approach would produce the right data (like sites that rely on WebSockets or heavy client-side logic) it falls back to using the DOM with playwright
reply