Hacker News new | past | comments | ask | show | jobs | submit login
ScreenAI: A visual LLM for UI and visually-situated language understanding (research.google)
262 points by gfortaine 8 months ago | hide | past | favorite | 40 comments



At OpenAdapt we have had excellent results combining Segment Anything Model (SAM) with GPT-4 for screen understanding.

Work-in-progress: https://github.com/OpenAdaptAI/OpenAdapt/pull/610


Interesting. Is this relying on SAM to segment text to send to GPT? How does it perform compared to GPT-V?


SAM is used to segment GUI elements, which are then sent to GPT4-V to be described. We then prompt GPT-4 with the user's actions and the descriptions of the GUI elements to generate new actions to accomplish different behaviors.


MIT license too (getting rare)! mad respects, thank you.


Thank you!


I find it quite ironic that google are the biggest players in creating solutions that actively contribute in defeating their very own anti automation software.

Makes you wonder if the goal of their captcha system was ever really to stop people from botting.


> Makes you wonder if the goal of their captcha system was ever really to stop people from botting.

I don't think that was the main goal, but rather for them to get a massive labeling dataset for training their models on the cheap.


In turns out it's fine to have the snake eat its own tail if your real goal is just to keep the snake fed.


"Remove the ads from this page"

I can't wait for AI to become the ultimate ad-removal tool.

There might be an arms race, but the anti-ad side will win as long as there isn't a unilateral winner (strongest models, biggest platform).

There will be enough of a shake up to the current regime -- search, browsers, etc. -- that there is opportunity for new players to attack multiple fronts. Given choice, I don't think users will accept a hamstrung interface that forces a subpar experience.

We basically just need to make sure Google, Microsoft/OpenAI, or some other industry giant doesn't win or that we don't wind up living under a cabal of just a few players.

I'm already hopefully imagining AI agents working for us to not just remove advertising noise, but to actively route around all of the times and places we're taken advantage of. That would be an excellent future.


I'm now envisioning an ad framework, where instead of selling rectangles of content, advertisers bid to use an llm to rewrite the whole article with at least {{4}} mentions of {{how refreshing pepsi is}}.


Apple won't allow AI-powered extensions. They currently don't even allow blacklist-powered extensions like UBlock. And that's the "privacy" focussed company, shoving ads down your throat.

The competitor is Google.

And 90% of users spend 90% of their time in walled gardens like Instagram or TikTok anyway. They see built-in ads.

Do I need to say more?


From time to time I see this comment pop up.

I think either I'm crossing paths with you a lot, and am always stricken by so much trusting enthusiasm... - or there're many people who have the same dream you're mentioning, in which case, one of you should build this product :-)


“It looks like this entire article is an advertorial piece for a book. Would you still like to read it?”


And most people would say it’s broken and only returns false positives. Well, half right, but the positives aren’t false…


Everything behind a paywall then?


that would be the 4d chess move: imagine when you get a captcha with "click 1 thing from things that do not fly" but you actually helping select drone targets somewhere in middle east


this is what happens at the intersection of unlimited VC money for "AI" and wannabe entrepreneurs that read Ender's Game and thought "business opportunity here"



huh exactly what I thought when I saw this


Isn't creating dataset for this the most easiest? we have source text of html and how they are rendered with all the intermediate info about tags, css layout etc available from most modern browsers.


I can't wait for the new wave of terrible UIs specifically designed to fool AI agents into clicking the "send me all your money" button. (For bonus points, do this while making the UI seem perfectly reasonable to humans.)


that would take UI dark patterns to a new level


I was looking for something similar recently and had found CogAgent[0] that looks quite interesting, has anyone tried anything similar?

0. https://github.com/THUDM/CogVLM?tab=readme-ov-file#gui-agent...


I haven't read through it yet, but there's FerretUI from Apple (mobile-specific, but I think a lot of learnings are generic) https://arxiv.org/abs/2404.05719


Imagine a top-level screen filter that processes all you see in ways which you define. For example, "hide all faces" can help you spot new details in a movie since your eyes won't be automatically attracted to faces. Or "hide all proper names" can make internet browsing more interesting and mysterious


Imagine a new crop of QA automation tooling that's going to leverage these capabilities!

• Semantic change comparison between screenshots. Visual regression testing, where you prompt the model to ignore certain things instead of masking and where it labels the changes with a message, like "chart color changed" or "text shifted down by 3 pixels".

• Using plain English as test scenarios, instead of brittle WebDriver-like APIs.

• Autonomous agent fuzzing the application by free roaming the UI.

• RAG the design artifacts from Jira and Google Docs for more targeted feature exploration and test scenario generation.

• Automatic bug reports as the output of the above. Or even send a draft PR to fix an issue, while at it!

The more I think about use cases, the more it sounds like full software development automation. Late in game, we won't probably need software as it exists today at all. This feels like reading Accelerando again, but this time it's happening for real and to you.

P.S.: Didn't expect to see a cafe from Cyprus - Akakiko Limassol - used as the demo, I'll remember to visit next time I'm in the area :P


As we were discussing recently how blind people use the computer, navigate the web, and write programs in code editors - ScreenAI and other ways of giving LLMs a visual mode are promising, giving people the ability to understand and interact with visual interfaces using natural language.


Google claims SoTA but it appears that, according to Apple, they may already be out of date: https://arxiv.org/abs/2404.05719


The core aspects of this research, datasets, and use cases discussed here have been in progress for quite a long time at Google (it's been WIP for many many years). The same can probably be said of Apple's paper though!

Congrats to all the folks involved :)


How does this compare to the new GPT-4-turbo vision or Claude 3 Opus vision? Also, is this open source or can we access it with Vertex AI?


We haven't been able to use Claude 3 Opus vision yet because we're in Canada, but GPT-4-V works extremely well (when combined with Segment Anything). See: https://github.com/OpenAdaptAI/OpenAdapt/pull/610 (work in progress).

Unfortunately we can't compare it to ScreenAI directly since as far as I can tell it is not generally available. However ScreenAI does not appear to use a separate segmentation step, which we needed to implement in order to get good results.


Update: it looks like Anthropic now accepts Canadian credit cards.

The results are not as good as GPT4-V or Gemini. I've posted the output for each of `gpt-4-vision-preview`, `gpt-4-turbo-2024-04-09`, `gemini-1.5-pro-latest`, and `claude-3-opus-20240229`. Claude is the only one who makes mistakes, at least on that test.


can you elaborate on "extremely well"? where is it currently falling short?


You can see how it performs in describing GUI elements in the linked PR if you scroll down -- here's a direct link:

https://private-user-images.githubusercontent.com/774615/320...

Regarding failure modes, we have yet to do extensive testing, but I've seen it confuse the divide and subtract buttons on the calculator before only once.


pardon my ignorance, but can I run this model locally?


I wondered that too, it's Google so it's going to be Tensorflow or accessible through some workbook.. and I've lost interest.


is this similar to apple’s Realm?


Is the model being released?


"We are also releasing three new datasets: Screen Annotation to evaluate the layout understanding capability of the model, as well as ScreenQA Short and Complex ScreenQA for a more comprehensive evaluation of its QA capability."

Looks useful to me for replicating some things. Good stuff!


I think the interpretability and privacy protection of the model are also issues that need attention, especially in scenarios involving user personal data and privacy. Therefore, although this research result has potential, further research and exploration are still needed in practical applications.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: