I think image-encoder from CLIP (even smallest variant ViT B/32) is good enough to capture a lot of semantic information to allow natural language query once images are indexed. A lot of work actually goes into integrating with existing meta-data like local-directory, date-time to augment NL query and re-ranking the results.
I work on such a tool[0] to enable end to end indexing of user's personal photos and recently added functionality to index Google Photos too!
I keep forgetting to put a benchmark for a standard flickr30k like dataset!
But a ballpark figure should be about 100ms per image on a quad-core CPU, i also generate an ETA during indexing and provide some meta-information to make it easy to get information about data being indexed.
It is quite possible B variant is not enough for some scenarios, earlier version also included the videos search, frames used for indexing were sometimes blur (not having fine-details) and these frames generally would have higher score for naive Natural language queries. I only tested with B variant.
But i resolved that problem upto a point by adding a Linear layer trained to discard such frames, and it was less costly than running a bigger variant for my use case.
Immich (self hosted Google photos alternative) has been using CLIP models for smart search for a while and anecdotally seems to work really well - it indexes fast and results are of similar quality to the giant SaaS providers.
> has been using CLIP models for smart search for a while and anecdotally seems to work really well [..] results are of similar quality to the giant SaaS providers
I'm not super familiar with how the results for the "giant SaaS providers" are, but the demo instance of Immich doesn't seem to do it very well.
Even the fourth result seems to rank higher than photos of actual airplanes, and most of the results aren't actually airplanes at all.
Again, not sure how that compares with other providers, but on Google Photos (as one example I am familiar with), searching for "airplane" shows me photos taken of airplanes, or photos taken from the inside of an airplane. Even lego airplanes seems to show up correctly, and none of the photos are incorrectly shown as far as I can tell.
I’ve just tried that and it’s true although on my instance searching ‘airplane’ gives good results. I wonder if it’s due to an insufficient number of images in the demo? I also took the advice in the forums to tweak the exact model version used.
Since llava is multimodal, I wonder if there's a chance here to strip a bit of complexity. Specifically, instead of going through 3 embeddings (llava internal, text, mini-lm), could you use the not-last layer of llava as your vector? It would probably require a bit of fine-tuning though.
Is anyone aware of a model that is trained to give photos a quality rating? I have decades of RAW files sitting on my server that I would love to pass over and tag those that are worth developing more. Would be nice to make a short list.
So I think some sort of hybrid between object recognition (like being discussed here as part of the workflow) and standard image processing stuff could be helpful there. E.g. it's not absolute sharpness that you're looking for it's the subject being sharp (and possibly sharper than in other photos from the same time period of the same subject).
For indexing images is probably convenient to directly calculate the embeddings using CLIP image encoder and retrieve them using the CLIP text encoder.
Going through a LLM may improve the performance. From my experience working with Stable Diffusion 1.*, clip is not very intelligent and a 7B quantised LLM could help a lot.
I second this. CLIP, BLIP, etc alone are light but pretty dumb for captioning in the grand scheme of things.
CLIP is reasonable for reverse image search via embeddings but many of the models in this class don't work very well for captioning because they're trained on COCO, etc and they're pretty generic.
But this specific use case the extracts an embedding from the caption which is where CLIP would skip a lot of overhead by going from the image to the embedding directly.
If you were solely doing reverse image search (submit image, generate embeddings, vector search) yes.
This is LLaVA -> text output -> sentence embedding -> (RAG style-ish) search on sentence embedding output based on query input text (back through the sentence embedding).
You could skip the LLaVA step and use CLIP/BLIP-ish caption output -> sentence embedding but pure caption/classification model text output is pretty terrible by comparison. Not only inaccurate, but very little to no context for semantic and extremely short so the sentence embedding models have poor quality input and not much to go on even when the caption/classification is decently accurate.
CLIP does not generate captions, it's simply an encoder, the image and text encoders are aligned so you don't need to generate a caption, you simply encode the image and you later retrieve it using the vector crated by the text encoder (the query).
I'm using CLIP here generically to refer to families/models generating captions by leveraging CLIP as the encoder - of which there are plenty on "The Hub".
Have you actually done the approach I think you're suggesting for anything more complex than "this is a yellow cat"? Not trying to be snarky, genuinely curious. I've done a few of these projects and this approach never comes close to meeting user expectations in the real world.
A nice work. I’m thinking it could even be tinkered further by incorporating location information, date and time, and even people (facial recognition) data from the photos, and have an LLM write one “metadata text” for every photo.
This way one can query “ person X traveling with Y to Norway about 7 years ago” and quickly get useful results.
This is pretty awesome, but I’m curious if it can be used to “enhance” the existing iCloud search which is great at identifying people in my photos even kids as they age.
Slightly related, are there any good photo management alternatives to Photoprism that leverages more recent AI/ML technologies and provides a GUI for end users?
One long winded way could be using the Lightroom for that. It finds and groups faces. Also, maybe it can save that info into fmthe file itself (with xmp)
I'm still tryna understand the difference between multimodal models like Llava and projects like JARVIS that connect LLMs to other huggingface models (including object detection models) or clip. Is a multimodal model doing this under the hood?
Object detection models have human-comprehensible outputs. You can feed in a picture and it'll tell you that there's a child and a cat, and it'll draw bounding boxes around them. You can pass that info into an LLM if you want.
The downside to that approach is the LLM can't tell whether the cat is standing in front of the child, or sitting on the child, or the child is holding the cat; the input just tells it there's a child, and a cat, and their bounding boxes overlap.
In contrast, LLaVA feeds feeds the image into a visual encoder called 'CLIP' which doesn't output anything human-comprehensible - it just gives out a bunch of numbers which have something to do with the contents of the image. But the numbers can be fed into the LLM along with text - and they can train the image encoder and the LLM together.
If the training works right, and they have enough training data for the model to figure out the difference between a cat sitting on a lap and one being held, they end up with a model that can figure out that the child is holding the cat.
I work on such a tool[0] to enable end to end indexing of user's personal photos and recently added functionality to index Google Photos too!
[0] https://github.com/eagledot/hachi