I have been playing with MetaCLIP this afternoon and made https://github.com/autodistill/autodistill-metaclip as a pip installable version. The Facebook repository has some guidance but you have to pull the weights yourself, save them, etc.
My inference function (model.predict("image.png")) return an sv.Classifications object that you can load into supervision for processing (i.e. get top k) [1].
The paper [2] notes the following in terms of performance:
> In Table 4, we observe that MetaCLIP outperforms OpenAI CLIP on ImageNet and average accuracy across 26 tasks, for 3 model scales. With 400 million training data points on ViT-B/32, MetaCLIP outperforms CLIP by +2.1% on ImageNet and by +1.6% on average. On ViT-B/16, MetaCLIP outperforms CLIP by +2.5% on ImageNet and by +1.5% on average. On ViT-L/14, MetaCLIP outperforms CLIP by +0.7% on ImageNet and by +1.4% on average across the 26 tasks.
CLIP is a such a nice paradigm shift. Historically, CV things were quite limited:
- You could predict a class (from a static list such as [dog,cat, ...]) or ...
- You could use image embeddings disconnected from text (you could tell image look-alikes but not what they actually represent). By "embedding" text and images in the same latent space, you can now query your images with text query (such a "a large dog") and find the relevant photos. CLIP understands semantics but also is not limited to a set list of classes (thanks to the ability to use of web data in training).
Very exciting. CLIP and latent space embeddings in general are such an intuitive to use and powerful tool.
I'm using it in some hobby projects, from semantic image search in private collections, to trading card recognition among tenthousands of cards.
Love to see more open source work from big players on this.
No discussion after 4 hours of existence, wondering if this is leaving people speechless or not... ;)
CLIP is a very interesting development in AI these days, so demystifying it is a great idea. Is anyone using CLIP or similar models daily and will find this research useful -- and willing to discuss it? I'm curious what you're doing.
We have ported CLIP to our Bottlenose camera. The results are very exciting and the possibilities are, for lack of better terms, endless. You can now tell the camera what to look for. Example, if using for manufacturing automation and the task is to detect if any product is missing a label: our customers can use natural language input "unlabelled product" and "labelled product". The system can now differentiate between the two and send results to a PLC. Previously this would have required a new machine learning loop to deploy.
We are generating embeddings on the camera and send them out via chunk-data on the GigE Vision 2.1 protocol.
I work for a computer vision company. I use CLIP almost every day. Example use cases for which I have used CLIP:
- Image classification
- Automated labeling for classification models
- Image clustering
- Gathering images for model training that are sufficiently dissimilar from existing samples
- Content moderation
CLIP is also being used widely in new research. SAM-CLIP, shared with me by my manager today, is using CLIP (https://arxiv.org/abs/2310.15308) and knowledge distillation for training a new model. I have seen references to CLIP throughout multimodal LLM papers, too, although my knowledge of multimodal model architectures is nascent.
I have a noob-to-CLIP question. When I've tried to use it to auto-caption photos or things the result has been like 4-5 words that may really vaguely describe the image but honestly it's usually like "A woman holding a pencil" and sometimes "A woman woman holding a pencil" or just "A woman"
Do different models do better or worse at this? Is this just untuned outputs? Like are there parameters I should be tweaking? Sorry I'm not able to give too much more detail. I'm mostly using it within A1111's "Interrogate CLIP" option but I've tried using a model I found on replicate as well as installing locally. Same results every time.
It seems vaguely useful but like it misses the mark a lot of the time. I'm assuming I'm doing something wrong.
For a task like that, I'd recommend LLaVA instead. It's still inaccurate, but it's a great deal more accurate than the other options I've tried. It also works with llama.cpp.
LLaVA is a multimodal language model you ask questions of. If you don't provide a question, then the default is "Describe this picture in detail". But if you have a concrete question, you're likely to get better results. You can also specify the output format, which often works.
(Make sure to use --temp 0.1, the default is far too high.)
It runs very slowly on CPU, but will eventually give you an answer. If you have more than about four-five pictures to caption, you probably want to put as many as possible as the layers on the GPU. This requires specific compilation options for CUDA; on an M1/M2 it's possible by default, but still needs to be turned on. (-ngl 9999)
iirc "Interrogate CLIP" is a bit of a misnomer - what it's actually doing is generating a basic caption with BLIP ("a woman holding a pencil"), then iterating over categories and checking with CLIP if any items in those categories are depicted in that image, then concatenating any hits to the resulting caption.
This means the resulting caption is of the form "[BLIP caption], [category1 item], [category2 item], ...". It's very rudimentary.
To clarify: CLIP can tell you if a text label matches an image. It can't generate a caption by itself.
There are more advanced captioning methods, but I'm not sure if they're exposed in A1111 (I haven't used it in some months)
Thank you for this! I've always been confused about BLIP vs CLIP. That makes a lot of sense and explains the weird duplication of a noun I see sometimes "A woman woman" things like that.
Very cool. I've used CLIP and VQGAN in a grad school project ~2years ago, when StyleGAN, StyleCLIP and similar projects were emerging for controlled image manipulations.
I found CLIP to be _amazing_ for all kinds of image search, like search-by-text or search-by-image. I even ported it to NumPy to understand it better. The whole thing is less than 500 lines of Python (including blank lines and comments): https://github.com/99991/NumPyCLIP
https://blog.roboflow.com/openai-clip/ is a good high-level guide to what CLIP can do. (Disclosure: I work at Roboflow, but I did not author this piece.)
Love that there is more accurate pre-trained CLIP model as it is a foundation for Stable Diffusion and many other very important open source models.
But, I would say that the main issue with CLIP is not performance, but that textual input is limited to 77 characters.
This is a severe limitations, if Meta or other company collected the dataset that allowed model with 1024 characters instead it would enrich the word of open source models much more than +2% accuracy.
My hope is that next person or company who works on that will invest into longer context size for text input :fingers_crossed:
My inference function (model.predict("image.png")) return an sv.Classifications object that you can load into supervision for processing (i.e. get top k) [1].
The paper [2] notes the following in terms of performance:
> In Table 4, we observe that MetaCLIP outperforms OpenAI CLIP on ImageNet and average accuracy across 26 tasks, for 3 model scales. With 400 million training data points on ViT-B/32, MetaCLIP outperforms CLIP by +2.1% on ImageNet and by +1.6% on average. On ViT-B/16, MetaCLIP outperforms CLIP by +2.5% on ImageNet and by +1.5% on average. On ViT-L/14, MetaCLIP outperforms CLIP by +0.7% on ImageNet and by +1.4% on average across the 26 tasks.
[1] https://github.com/autodistill/autodistill-metaclip [2] https://arxiv.org/pdf/2309.16671.pdf