SAM is used to segment GUI elements, which are then sent to GPT4-V to be described. We then prompt GPT-4 with the user's actions and the descriptions of the GUI elements to generate new actions to accomplish different behaviors.
I find it quite ironic that google are the biggest players in creating solutions that actively contribute in defeating their very own anti automation software.
Makes you wonder if the goal of their captcha system was ever really to stop people from botting.
I can't wait for AI to become the ultimate ad-removal tool.
There might be an arms race, but the anti-ad side will win as long as there isn't a unilateral winner (strongest models, biggest platform).
There will be enough of a shake up to the current regime -- search, browsers, etc. -- that there is opportunity for new players to attack multiple fronts. Given choice, I don't think users will accept a hamstrung interface that forces a subpar experience.
We basically just need to make sure Google, Microsoft/OpenAI, or some other industry giant doesn't win or that we don't wind up living under a cabal of just a few players.
I'm already hopefully imagining AI agents working for us to not just remove advertising noise, but to actively route around all of the times and places we're taken advantage of. That would be an excellent future.
I'm now envisioning an ad framework, where instead of selling rectangles of content, advertisers bid to use an llm to rewrite the whole article with at least {{4}} mentions of {{how refreshing pepsi is}}.
Apple won't allow AI-powered extensions. They currently don't even allow blacklist-powered extensions like UBlock.
And that's the "privacy" focussed company, shoving ads down your throat.
The competitor is Google.
And 90% of users spend 90% of their time in walled gardens like Instagram or TikTok anyway. They see built-in ads.
I think either I'm crossing paths with you a lot, and am always stricken by so much trusting enthusiasm... - or there're many people who have the same dream you're mentioning, in which case, one of you should build this product :-)
that would be the 4d chess move: imagine when you get a captcha with "click 1 thing from things that do not fly" but you actually helping select drone targets somewhere in middle east
this is what happens at the intersection of unlimited VC money for "AI" and wannabe entrepreneurs that read Ender's Game and thought "business opportunity here"
Isn't creating dataset for this the most easiest? we have source text of html and how they are rendered with all the intermediate info about tags, css layout etc available from most modern browsers.
I can't wait for the new wave of terrible UIs specifically designed to fool AI agents into clicking the "send me all your money" button. (For bonus points, do this while making the UI seem perfectly reasonable to humans.)
I haven't read through it yet, but there's FerretUI from Apple (mobile-specific, but I think a lot of learnings are generic) https://arxiv.org/abs/2404.05719
Imagine a top-level screen filter that processes all you see in ways which you define. For example, "hide all faces" can help you spot new details in a movie since your eyes won't be automatically attracted to faces. Or "hide all proper names" can make internet browsing more interesting and mysterious
Imagine a new crop of QA automation tooling that's going to leverage these capabilities!
• Semantic change comparison between screenshots. Visual regression testing, where you prompt the model to ignore certain things instead of masking and where it labels the changes with a message, like "chart color changed" or "text shifted down by 3 pixels".
• Using plain English as test scenarios, instead of brittle WebDriver-like APIs.
• Autonomous agent fuzzing the application by free roaming the UI.
• RAG the design artifacts from Jira and Google Docs for more targeted feature exploration and test scenario generation.
• Automatic bug reports as the output of the above. Or even send a draft PR to fix an issue, while at it!
The more I think about use cases, the more it sounds like full software development automation. Late in game, we won't probably need software as it exists today at all. This feels like reading Accelerando again, but this time it's happening for real and to you.
P.S.: Didn't expect to see a cafe from Cyprus - Akakiko Limassol - used as the demo, I'll remember to visit next time I'm in the area :P
As we were discussing recently how blind people use the computer, navigate the web, and write programs in code editors - ScreenAI and other ways of giving LLMs a visual mode are promising, giving people the ability to understand and interact with visual interfaces using natural language.
The core aspects of this research, datasets, and use cases discussed here have been in progress for quite a long time at Google (it's been WIP for many many years). The same can probably be said of Apple's paper though!
We haven't been able to use Claude 3 Opus vision yet because we're in Canada, but GPT-4-V works extremely well (when combined with Segment Anything). See: https://github.com/OpenAdaptAI/OpenAdapt/pull/610 (work in progress).
Unfortunately we can't compare it to ScreenAI directly since as far as I can tell it is not generally available. However ScreenAI does not appear to use a separate segmentation step, which we needed to implement in order to get good results.
Update: it looks like Anthropic now accepts Canadian credit cards.
The results are not as good as GPT4-V or Gemini. I've posted the output for each of `gpt-4-vision-preview`, `gpt-4-turbo-2024-04-09`, `gemini-1.5-pro-latest`, and `claude-3-opus-20240229`. Claude is the only one who makes mistakes, at least on that test.
Regarding failure modes, we have yet to do extensive testing, but I've seen it confuse the divide and subtract buttons on the calculator before only once.
"We are also releasing three new datasets: Screen Annotation to evaluate the layout understanding capability of the model, as well as ScreenQA Short and Complex ScreenQA for a more comprehensive evaluation of its QA capability."
Looks useful to me for replicating some things. Good stuff!
I think the interpretability and privacy protection of the model are also issues that need attention, especially in scenarios involving user personal data and privacy. Therefore, although this research result has potential, further research and exploration are still needed in practical applications.
Work-in-progress: https://github.com/OpenAdaptAI/OpenAdapt/pull/610