> Could probably write a book about the lessons learned getting a "vision first" approach to automation working
ha that would be splendid! please do maybe even a blog on valetnet.dev (lovely site btw a demo or video would be a nice)
I'm convinced vision first is the way to go despite people saying its slow the benefits are tremendous as lot of websites simply do not play nice with HTML and I do not like having to inspect XHR to figure out APIs
SikuliX was my last love affair with this approach but eventually I lost interest in scraping and automation so I'm pleased to see people still working on vision first automation approaches.
Agreed on the need for a demo. #1 on the TODO list! If I know at least one person will read it, I might even do a blog, too! :)
The rise of multi-modal LLMs is making "vision first" plausible. However, my basic test is asking these models to find the X,Y screen coordinates of the number "1" on a screenshot of a calculator app. ChatGPT-4o still can't do it. Same with LLaVA 1.5 last I tried. But I'm sure it'll get there someday soon.
Yeah, SikuliX was dependent on old school "classic" OpenCV methods. No machine learning involved. To some extent those methods still work in highly constrained domains like UI automation... But I'm looking forward to sprinkling in some AI magic when it's ready.
ha that would be splendid! please do maybe even a blog on valetnet.dev (lovely site btw a demo or video would be a nice)
I'm convinced vision first is the way to go despite people saying its slow the benefits are tremendous as lot of websites simply do not play nice with HTML and I do not like having to inspect XHR to figure out APIs
SikuliX was my last love affair with this approach but eventually I lost interest in scraping and automation so I'm pleased to see people still working on vision first automation approaches.