Hacker News new | past | comments | ask | show | jobs | submit login

It seems like small multimodal LLMs have a killer use case to be bundled with browsers for accessibility. Eventually:

* if an image doesn’t have alt text

* you need to be read the page

* you need to be described what’s happening in a video

A model built into the OS or browser seems like a no-brainer.




Chrome already has optional built-in support for generating alt text for images. It's been there for years, using a server-based API.

It does seem possible that this could be replaced with a local model in the near future. It's not clear the average user has the hardware specs for this to be an option today, but it will increasingly be plausible.

Keep in mind, though, that alt text is just one small part of making a web site accessible.


> It does seem possible that this could be replaced with a local model in the near future. It's not clear the average user has the hardware specs for this to be an option today, but it will increasingly be plausible.

Siri does something like this when reading messages into your AirPods. It will give brief descriptions of photos sent in the message. I'm pretty sure it's all run locally.


Siri has the advantage of running on either an iPhone or a Macbook. Chrome has to run on budget android phones and chromebooks.



Right now I use LLMs to generate alt text for images, and they are better than any I would have written by hand. Only in about 1% of cases do I need to correct anything.


LLM-generated descriptions miss lots of context. For instance, depending on the site and content, we might mention people's races or fashion. Other times we don't.



> A model built into the OS or browser seems like a no-brainer.

How about localizing it into all languages supported by a major OS or browser?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: