There are other datasets being developed that use high quality images that are manually labeled by humans, such as by Unstable Diffusion who's having a Kickstarter right now . They say they will be able to get a much more high quality model due to such high quality images and captioning, so we'll see. They also want to make the model and code entirely open source rather than the license that Stable Diffusion has which is not open source (it has many restrictions, enforceable or not, on the images made).
The idea that we proved is you can get a dataset with decent caption and images (that do match yes, you can see for yourself at https://rom1504.github.io/clip-retrieval/ ) that can be used to trained well performing models (eg openclip and stable diffusion) while using only automated filtering of a noisy source (common crawl)
We further proved that idea by using aesthetic prediction, nsfw and watermark tags to select the best pictures.
Is it possible to write caption manually? sure, but that doesn't scale much and won't make it possible to train general models.
Maybe, I don't think so however based on the above comments by Unstable Diffusion. It seems like people are underestimating the power of high quality data and just throwing the kitchen sink at models. Perhaps a set of good quality data can indeed outperform Laion-style datasets.
It's like the YC saying about doing things that don't scale, perhaps with the high quality dataset, we can train better models than CLIP and in turn use those to caption the rest of the images, only now the caption model is much better than previous ones.
At the next generation it feels like the training set will be inbreeding on the flood of stable diffusion images with 7 mangled fingers, heads coming out of legs, etc.
I can see LAION playing a similar role to imagenet. The main application of imagenet isn't directly training image recognition models. It's pertaining on diverse data so that a "big" (big in 2016) model can be fine tuned easily on a small dataset, after learning to be a good feature extractor. From that perspective, the label quality (and concerns about bias and whatnot) are almost irrelevant
Even if a million people are labeling images, without any overlap, 5 billion images would mean each of them has to label 5000 images each.
What Unstable diffusion folks seem to be doing is that they’re using a few thousand labeled images to train a caption generation model and then use it to create a huge multimodal dataset with text and high quality images.
I never claimed this either.
Here's a short video of some recent results for LAION 400M https://www.youtube.com/watch?v=dlRCm29Upu4
This is not the only example of well curated image-tag pairs, especially in artistic circles. It's just that most of them are not CC.
Danbooru is only one such example of well curated tagging and if we ignore copyright there are far more examples. These example just serve as evidence that refining poor labeling is not outside of the relm of possibility as you suggested.
An example, an image is tagged: kanna_kamui, kimono and torhu_(maiddragon), who has the kimono? Kanna, Torhu or both? It cannot be known, but with natural language it is possible to describe who is wearing what.
Edit: an experiment comparing tags/BOW vs natural language sequences in image generation tasks would be interesting to see.
I think you mean "the developers of this technology do not want to pay to have hundreds of millions of images labeled".
Unstable Diffusion is also doing their captioning like how I mentioned, with groups of volunteers as well as hired individuals.
They seem to have created a much smaller dataset than LAION's, it would not work to train a generative model on such a small amount of images (obviously the images here do not have a single domain).
1. From a reproducibility perspective, isn't this kinda brittle in that even without malicious intent, some of those images will no longer be available when other researchers attempt to download them?
2. From a resilience perspective, if your site has some of the images in the dataset, could you swap in another image with the correct dimensions. Could you poison or skew the model in any interesting ways?
don't know about the legality too.. does laion check all the licenses? or they skirt that by using urls?
1) The chance that a significant percentage of the images become unavailable is low. Also, training on such a big dataset means your model generalizes well and is usually robust.
2) Again, you would need to inject adversarial/malicious images to a significant number of those links in the dataset for it to have actual impact on trained model. Again, unlikely.
For point 2, I think it's possible that for some narrow topics, some domains have a significant share of images. I think these can affect the model, which is in part why they give special attention to watermarking. Suppose instead of merely watermarking images, for every image on my large collegiate track and field website I make sure someone is wearing a garment with a visible Nike swoosh. Can I skew the model towards associating Nike with the sport? I think this kind of thing may be achievable for niche areas.
LAION-5B has enabled some really cool technologies and a lot of promising startups. This work should have been carried out responsibly.
Seems like an open and shut fair use claim, web indexing (not even scraping, just indexing) is not uncommon...
This seems to be legal in many countries (from what I know, the UK, EU, Japan and Singapore) due to the TDM (Text and Data Mining) exception, especially for researchers.
All these concern trolls are bad actors. All of them.
Unless you literally steal someone's work and use it / sell it as your own, all the data mining is moral and should be legal if it isn't already.
Artists creating work are not releasing it on the internet with ShareAlike licenses or any other license which openly allows derivative work or further distribution without a license. This is literally providing a means to stealing people’s work.
How is this any different from providing a listing of copyrighted movies and games and a means to download them, a la The Pirate Bay?
Quick review of their site and the paper turns up nothing that commonly would be a topic that might merit such a review.
LAION can say all they want that they’re not including images in their dataset. They include a script to download those URLs into images on disk. By being a company that’s not bound to decades of university ethics regulations, they are seemingly allowed to skirt what you learn on your first day as a researcher in academia. It may be legal, but it sure is not ethical.
This defense, that they merely provide links and not images, is the thin layer of abstraction that their entire ethics case is built on top of. They give you everything needed to create massive datasets of human data without doing it for you.
For the record, that's what I've been trying to do; my stumbling blocks have been training time and a bug where my resulting pipeline seems to do the opposite of what I ask. I'm assuming it's because my wikitext parser was broken and CLIP didn't have enough text data to train on; I'll have the answer tomorrow when I have a fully-trained U-Net to play with.
If I can ever get this working, I want to also build a CLIP pipeline that can attribute generated images against the training set. This would make it possible to safely use CC-BY and CC-BY-SA datasets: after generating
 At least in the US. Other jurisdictions think that scanning an image recopyrights it, see https://en.wikipedia.org/wiki/National_Portrait_Gallery_and_...
 Watch out for anything tagged with https://commons.wikimedia.org/wiki/Template:Royal_Museums_Gr... as that will taint your model.
The positive solution here was to not collect data if there was a reasonable ethical concern. This classic mindset of “anything goes as long as we create value” is highly toxic.
At the least, Stable Diffusion 2.X is better at pain points of image generation such as text legibility and hands, potentially due to having more data points.
It's more probable a finger has a finger on both sides of it than not. So the model diffuses lots of adjacent fingers.
I can make things up too!
But they’re harder to control without negative prompting.
Either way, the main point is the same: more training data gives better results.
The copyright on the actual images and text labels is far more of a problem. Generally speaking, it is extremely infringing to collect a bunch of images or captions and redistribute them. Like, getting-to-the-heart-of-ownership, if-you-cant-sue-for-this-you-own-nothing kind of infringing. However, it's when you start talking about training ML systems on the dataset that things get interesting. In the EU, there's an explicit copyright exception for data mining that would apply to, say, training DALL-E or Stable Diffusion.
But what about the US? Well, it's legal to crawl the web; and we specifically have Authors Guild v. Google where the Authors Guild lost in court trying to keep Google from scanning large numbers of books. AI researchers have sort of just taken this to mean "training AI is fair use". This is not court-tested, but it at least jives with some precedent, so I think it's OK to assume it's true.
However, it means absolutely nothing for the people actually using the AI, because fair use is not transitive. If I take every YouTube video review of a movie and edit them down to just the movie clips being used, and then assemble them back together... I haven't somehow made a "fair use copy" of a movie that you can just share around. I've just made the most inefficient form of copyright infringement you can do with a computer. Likewise, if I train an AI on a movie, that can be fair use, but asking it to spit the movie back out is not.
Now, keep in mind that some ML systems (such as Copilot) are very eager to reproduce their training set data. Sometimes in situations you wouldn't expect. These sorts of things are ticking time bombs for people who want to generate novel images, because the AI having trained on such a massive data set also gives you access to basically the whole data set. That's half of a US copyright infringement claim right there - the other half being substantial similarity, which basically is the "Corporate needs you to find the differences between these two pictures" meme.
The only way to keep AI from infringing copyright is to make sure it never sees anything that could potentially be under copyright.
 Strictly speaking, the EU has a separate concept of sui generis database ownership, but for this discussion we can treat it the same as copyright.
If you're wondering why phone books aren't copyrightable, the term of art to search for is "sweat of the brow".
However, this factor is treated with little weight. Practically speaking it is very difficult to imagine a reuse of a copyrighted work that does not carry some commercial benefit to someone. At the very least, not having to pay for a license is a commercial benefit of its own. Generally speaking, assume all uses are commercial and you will understand a lot of modern fair use cases.
Legally speaking, "fair use" and "derivative work" are mutually exclusive. In fact, both terms were coined at the same time when SCOTUS created the entire derivative rights regime basically out of thin air in Folsom v. Marsh. They needed a legal tool to prevent people from stealing large sections of a work, but also didn't want to allow copyright owners to abolish the 1st Amendment. Hence, they set up a set of deliberately murky legal tests to determine if a use was "fair" or not.
If you want a quick standard to gut-check against, the question you'd ask is: "is this use something that other people would ordinarily pay for?" If so, then it's infringing. If not, then it might be fair use. So you can see why making an image generator might be fair use, but it's output would be infringing if you could identify an original work the AI was cribbing from. It'd be difficult to even fathom how licensing on a training set would work, given that there's no clear chain of value from a particular entry in the set to a particular model weight or output. But we can clearly identify if an AI system is regurgitating training output, or has been told to copy someone else's work and change it a little - at least after-the-fact in a court of law.
This is an incomplete analogy but from the time a baby is born that baby will have seen 1,892,160,000 frames of data per eye 3,784,320,000 frames in a year. That baby practically knows nothing about the world still.
I will copy paste the main findings from the article here:
- Data, not size, is the currently active constraint on language modeling performance. Current returns to additional data are immense, and current returns to additional model size are miniscule; indeed, most recent landmark models are wastefully big.
- If we can leverage enough data, there is no reason to train ~500B param models, much less 1T or larger models.
- If we have to train models at these large sizes, it will mean we have encountered a barrier to exploitation of data scaling, which would be a great loss relative to what would otherwise be possible.
- The literature is extremely unclear on how much text data is actually available for training. We may be "running out" of general-domain data, but the literature is too vague to know one way or the other.
- The entire available quantity of data in highly specialized domains like code is woefully tiny, compared to the gains that would be possible if much more such data were available.
The assumption that human eyes can be measured in FPS is, in itself, very questionable. And if it were indeed the case, then it would surely be far in access of 60fps…
In the strictest sense, yes. But it seems quite reasonable to think that there is something like an "FPS equivalent" for the human eye. I mean, it's not magic, and physics comes into play at some level. There's a shortest unit of time / amount of change that the eye can resolve. From that you could work out something that is analogous to a frame-rate.
And if it were indeed the case, then it would surely be far in access of 60fps
Not necessarily. Quite a few people believe that the human eye "FPS equivalent" is somewhere between 30-60 FPS. That's by no means universally accepted and since it's just an analogy to begin with the whole thing is admittedly a little big dodgy. But by the same token, it's not immediately obvious that the human "FPS equivalent" would be "far in excess of 60 FPS" either.
Sure. Otherwise movies and video wouldn't work at all.
It would be nice to have a dataset of a couple "raising" a Video recorder for 1 year as if they would a baby. A continuous stream of data.
Could train a model to predict the next frames based on what it's seen so far.
The project I'm working on right now is to build a sort of "body" for a (non ambulatory, totally non anthropomorphic) "baby AI" that senses the world using cameras, microphones, accelerometer/magnetometer/gyroscope sensor, temperature sensors, gps, etc. The idea is exactly to carry it around with me and "raise" it for long periods of time (a year? Sure, absolutely, in principle. But see below) and explore some ideas about how learning works in that regime.
The biggest (well, one of the biggest) challenge(s) is going to be data storage. Once I start storing audio and video the storage space required is going to ramp up quickly, and since I'm paying for this out of my own pocket I'm going to be limited in terms of how much data I can keep around. Will I be able to keep a whole year? Don't know yet.
There's also some legal and ethical stuff to work out, around times when I take the thing out in public and am therefore recording audio and video of other people.
But could still be useful to research institutes who follow privacy guidelines.
It might be best to do a short stint of 1 week to test the feasibility. That should give you a good estimate on future projections of how much data it will consume after a month, 3 months, and a year.
I imagine any intelligent system could work with reduced data quality/lossy data at least on the audio.
As long as it's consistent in the type/amount of compression. So instead of WAV/FLAC/RAW. You could encode it to something like Opus 100 Kbps and that would give you 394.2 Gigabytes of Data for a single year for the audio.
As for video... it would definitely require a lot of tricks to store on a hobbyist level.
Yep. That's basically the approach I took with "phase 1" where the only data being ingested was gps / accelerometer data. I just let it run for a couple of weeks and then extrapolated out what the storage requirements would be for the future. Obviously audio and video are going to change the equation a lot, but the same principle is what I am planning to employ.
Yep, that's another area I've been thinking a lot about. The "instinct" is to capture everything at the highest possible resolution / sampling rate / etc. and store in a totally lossless format. But that is also the most expensive scenario and if it's not strictly required, then why do it? We know human hearing at least can work with relatively crappy audio. Look at the POTS phone system and it's 8khz of bandwidth for example. Does that analogy hold for video? Good question.
Definitely. One thing that may help with costs in the short-term is that I'm very explicitly not (for now anyway) using a cloud storage service. Data ingestion is to a server I own and physically have in my home. I can get away with this because while the aggregate total amount of data may wind up fairly big over longer periods of time, the rate at which I need to ingest data isn't all that high (there's only one of these devices sending to the server). And I can just keep adding 5TB or 10TB drives as needed. When one fills up, I can unplug it, replace it with another, label and store it, and move on. The big risks here are that I don't really have any redundancy in that scenario, especially if my home burns down or something. But in that case I have bigger problems to worry about anyway!
There are other downsides to this approach, like dealing with the case of needing to access the entire year's worth of data "at once" for analysis or training, but I'm not sure that need will ever even arise.
Babies have a much harder task. They have to construct a corpus of knowledge from absolutely nothing.
GPT's ability to fool intelligent people into thinking that it is "intelligent" itself seems like a powerful argument that language, more than anything else, is what makes humans capable of higher thought. Language is all GPT has. (Well, that and a huge-ass cultural database.)
Intelligence is one of those areas in which, once you fake it well enough, you've effectively made it. Another 10x will be enough to tie the game against an average human player.
Take a baby and stick it in a room. Let it grow up with absolutely no stimulation whatsoever. They are given food and that's about it. What do you think it can demonstrate knowledge of by the time it reaches 5? 10? 15?
All behavior is learned behavior. People talk about sucking and breathing and walking horses and what not, but babies do have to learn how to latch and how to feed. Now, they can work it out themselves. But quick acquisition of a skill does not mean the skill already existed.
Not to mention it's a far cry from sucking to language. Or knowing what a person is. Or who a person is.
Within these respective modes are even more subgroups e.g. language translation, audio diarization. For sd you can consider animation and photographs as separate modes the model has to learn. Although the language is fuzzy and im not being statistically rigorous as it is a weak point of mine.
(In fact it’s exactly the same; it’s allowed under the same laws and it respects robots.txt.)
What's notable is "AI users are trying to copy an artist" != "AI has learned from an artist" != "AI has seen the artist's images in the first place". The most popular supposedly stolen-from artist Greg Rutkowski is not in StableDiffusion's training images, even though users are actively trying to copy him, it's a coincidence that it appears to work. Is that unethical?
Also, AI laws (text and data mining exemptions) /have/ been put in place - to make this explicitly legal!
> complete with watermarks and all
That's not the AI "copying their watermarks", it's the AI learning "sometimes images have watermarks" and giving them some of its own.