Looks nice. I find the cleaning HTML step in our cleaning pipeline extremely important, otherwise there is no real benefit from just using a general vision model and clicking coordinates (and whole HTML is just way too many tokens). How do you guys handle that?
Which one? The article has four examples, none of which are particularly "cool" or impressive.
If anything, the examples involving moving the mouse to the address bar or getting csv's of results are very poor examples, because we can already do that much better without "computer use".
Compatible with any LLMs and agentic framework