I've been generally frustrated at the lack of analysis of vision LLMs generally....

r2_pilot · 2024-07-10T17:43:22.000000Z

Why not use them yourself if you have access? I have been using Claude 3.5 Sonnet for gardening recently, and while it's not perfect(and can be a little blind unless you tell it to focus on a specific thing), it's helped me understand how to keep my plants alive in some challenging conditions(for me; this is my second or third attempt at gardening so it's all challenging lol). But just a experiment with it and see where the capabilities lie. I do agree that certain classes of visual data are challenging for it.

simonw · 2024-07-10T18:12:37.000000Z

I've used them a bunch. I want to learn from other people's experiences as well.

Some of my notes so far:

- https://simonwillison.net/2024/Apr/17/ai-for-data-journalism... - my datasette-extract plugin, for structured data from both text and images

- https://simonwillison.net/2024/Apr/17/ai-for-data-journalism... - where they failed to extract data from a handwritten scanned document in various weird ways

- https://simonwillison.net/2024/Feb/21/gemini-pro-video/ talks about video inputs to Gemini Pro (which are actually image inputs, it splits them up to one frame per second)

simonw · 2024-07-10T18:29:19.000000Z

Anthropic have some interesting cookbook examples that provide advice on using their multimodal models here: https://github.com/anthropics/anthropic-cookbook/tree/main/m...

I've assembled a bunch more notes here: https://simonwillison.net/tags/vision-llms/