Looks like you missed the whole point of this dataset.
The idea that we proved is you can get a dataset with decent caption and images (that do match yes, you can see for yourself at https://rom1504.github.io/clip-retrieval/ ) that can be used to trained well performing models (eg openclip and stable diffusion) while using only automated filtering of a noisy source (common crawl)
We further proved that idea by using aesthetic prediction, nsfw and watermark tags to select the best pictures.
Is it possible to write caption manually? sure, but that doesn't scale much and won't make it possible to train general models.
It's possible to so a much better job with automation too. Context aware cropping, accurate aspect ratio, quality filtering by various metric... all solved problems long ago, but absent from Laion-5B for some reason. Perhaps it would be a good idea go collaborate more closely with image experts for the next round.
> Is it possible to write caption manually? sure, but that doesn't scale much and won't make it possible to train general models.
Maybe, I don't think so however based on the above comments by Unstable Diffusion. It seems like people are underestimating the power of high quality data and just throwing the kitchen sink at models. Perhaps a set of good quality data can indeed outperform Laion-style datasets.
It's like the YC saying about doing things that don't scale, perhaps with the high quality dataset, we can train better models than CLIP and in turn use those to caption the rest of the images, only now the caption model is much better than previous ones.
The new Unstable Diffusion model will be one of the several SD finetuned model out there, these models usually have much higher quality (but smaller image diversity) because they take the coherency of SD and costrain the distribution to a small high quality portion, this means that you can train a model on a smaller high quality dataset from scratch but you would not, for example, have the same level of coherency, this can only be obtained with an incredible amount of images, and they don't need to be "high quality", a man will almost always have 2 arms, 2 legs etc... regardless of the quality of the images, and after the model has fit the entire distribution you can finetune it to produce high quality and coherent images with a small dataset, that's why Unstable Diffusion will finetuned a SD checkpoint, also why researchers use these big dataset like LAION-400M/5B
> and they don't need to be "high quality", a man will almost always have 2 arms, 2 legs etc...
At the next generation it feels like the training set will be inbreeding on the flood of stable diffusion images with 7 mangled fingers, heads coming out of legs, etc.
LAION-400M/5B will obviously not change (and there is enough data to train a really good model), if a future dataset has AI-generated images, these will be highly curated as the images were chosen by the person who was using the model and probably further shared by other users, it would work like a complicated Reinforcement Learning from Human Feedback (RLHF), plus AI-generated images will usually have keywords such as "AI," "Midjourney" in the caption so that the model can learn to distinguish them from the rest of the dataset (and CFG comes to the rescue when the dataset is noisy).
I'd guess there is a bias-variance tradeoff. If you just want to make a certain kind of image, no doubt a manually labeled and curated dataset can be better. If you want a generic generative model that has learned a wide variety of stuff, scale wins.
I can see LAION playing a similar role to imagenet. The main application of imagenet isn't directly training image recognition models. It's pertaining on diverse data so that a "big" (big in 2016) model can be fine tuned easily on a small dataset, after learning to be a good feature extractor. From that perspective, the label quality (and concerns about bias and whatnot) are almost irrelevant
Thanks to approximate knn, it's possible to query and explore that 5B datasets with only 2TB of local storage, anyone can download the knn index and metadata to run that locally too.
Regarding duplicates, indeed it's an interesting topic!
Laion5b deduplicated samples by url+text, but not by image.
To deduplicate by image you need to have an efficient way to compute whether image a and b are the same.
An idea to do that is to compute an hash based on clip embeddings. A further idea would be to train a network actually good at dedup and not only similarity by training on positive and negative pairs, eg with triple loss.
You are probably very aware of it, but just to highlight the importance of this for people who aren't aware: data duplication degrades the training and makes memorization (and therefore plagiarism, in the technical sense) more likely. For language models, this includes near-similarities, which I'd guess would extend to images.
Using clip for searching is better than direct text indexing for a variety of reasons but here for example because it matches better what stable diffusion sees
Still interesting to have a different view over the dataset!
If you want to scale this out, you could use elastic search
Let me start by saying that laion is a non profit, open to anyone that want to contribute.
Agreed about the website css. Do you want to contribute?
What's the problem with the dataset name exactly? Seems to work pretty well.
Yes the dataset is an extract of common crawl, this is an accessible to all method to produce valuable dataset. This is unlike supervised dataset which are reserved to organization with millions of dollars to spend on annotation and do not scale.
Non annotated datasets are the base of self supervised learning, which is the future of machine learning. Image/text with no human label is a feature, not a bug. We provide safety tags for safety concerns and watermark tags to improve generations.
It also so happens that this dataset collection method has been proven by using laion400m to reproduce clip model. (And by a bunch of other models trained on it)
That's interesting indeed!
Note that the description search as well as the image search are both using a knn index on embeddings and not exact search. That helps for finding semantically close by items but indeed for exact reference match it might not be the best solution.
Re-indexing the dataset with something like elastic search would give the reference search results you expect.