Hacker News new | past | comments | ask | show | jobs | submit login

I've been generally frustrated at the lack of analysis of vision LLMs generally.

They're clearly a very exciting category of technology, and a pretty recent one - they only got good last October with GPT-4 Vision, but since then we've had more vision models from Anthropic and Google Gemini.

There's so much more information about there about text prompting compared to image prompting. I feel starved for useful information about their capabilities: what are vision models good and bad at, and what are the best ways to put them to work?




Why not use them yourself if you have access? I have been using Claude 3.5 Sonnet for gardening recently, and while it's not perfect(and can be a little blind unless you tell it to focus on a specific thing), it's helped me understand how to keep my plants alive in some challenging conditions(for me; this is my second or third attempt at gardening so it's all challenging lol). But just a experiment with it and see where the capabilities lie. I do agree that certain classes of visual data are challenging for it.


I've used them a bunch. I want to learn from other people's experiences as well.

Some of my notes so far:

- https://simonwillison.net/2024/Apr/17/ai-for-data-journalism... - my datasette-extract plugin, for structured data from both text and images

- https://simonwillison.net/2024/Apr/17/ai-for-data-journalism... - where they failed to extract data from a handwritten scanned document in various weird ways

- https://simonwillison.net/2024/Feb/21/gemini-pro-video/ talks about video inputs to Gemini Pro (which are actually image inputs, it splits them up to one frame per second)


Anthropic have some interesting cookbook examples that provide advice on using their multimodal models here: https://github.com/anthropics/anthropic-cookbook/tree/main/m...

I've assembled a bunch more notes here: https://simonwillison.net/tags/vision-llms/




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: