Is there a good explanation of how to train this from scratch with a custom dataset[0]?
I've been looking around the documentation on Huggingface, but all I could find was either how to train unconditional U-Nets[1], or how to use the pretrained Stable Diffusion model to process image prompts (which I already know how to do). Writing a training loop for CLIP manually wound up with me banging against all sorts of strange roadblocks and missing bits of documentation, and I still don't have it working. I'm pretty sure I also need some other trainables at some point, too.
[0] Specifically, Wikimedia Commons images in the PD-Art-100 category, because the images will be public domain in the US and the labels CC-BY-SA. This would rule out a lot of the complaints people have about living artists' work getting scraped into the machine; and probably satisfy Debian's ML guidelines.
Ah I am glad to see someone else talking about using public domain images!
Honestly it baffles me that in all this discussion, I rarely see people discussing how to do this with appropriately licensed images. There are some pretty large datasets out there of public images, and doing so might even help encourage more people to contribute to open datasets.
Also if the big ML companies HAD to use open images, they would be forced to figure out sample efficiency for these models. Which is good for the ML community! They would also be motivated to encourage the creation of larger openly licensed datasets, which would be great. I still think if we got twitter and other social media sites to add image license options, then people who want to contribute to open datasets could do so in an easy and socially contagious way. Maybe this would be a good project for mastodon contributors, since that is something we actually have control over. I'd be happy to license my photography with an open license!
It is really a wonderful idea to try to do this with open data. Maybe it won't work very well with current techniques, but that just becomes an engineering problem worth looking at (sample efficiency).
Human artists derive their inspiration and styles from a large set of copyrighted works, but they are free to produce new art despite of that. Art would have developed much slower and be much poorer if, for example, Impressionism or Cubism had been entangled in long ownership confrontations in courts.
Then there's the fact that humanity has been able to develop and share art and literary works for thousands of years without the modern copyright system.
It would be interesting to see if this technology can erode the copyright concept a bit. Maybe not remove it completely, but perhaps influence people to create wider definitions for "fair use", and undo the extensions that Disney lobbyists have created.
That is a very apropos reference. If you're familiar with Cubism, you know that there's Picasso, and then there's Braque. The one is an art celebrity beyond almost any other, and the other isn't.
But they developed Cubism in parallel. There were periods where their work was almost indistinguishable. "Houses at l'Estaque", the trope namer for Cubism thanks to the remarks of a critic, was in fact by Braque.
You can generate infinite recognizable Basquiat from an AI, but is it Basquiat? No, of course not, because Basquiat's style operates within the context of a specific individual human making a point about expectations and the interface between his race and his artistic boldness and audacity as experienced by his wealthy audience. Making an AI 'ape' (!) his art style is itself quite the artistic statement, but it's not the same thing in the slightest.
You can generate infinite Rothko as 512x512 squares, but if you don't understand how the gallery hangings work and their ability to fill your entire visual field with first carefully chosen color, and then a great deal of detail at the threshold of perception of distinctions between color shades meant to further drive home the reaction to the basic color's moods, what you generate is basically arbitrary and nothing. Rothko isn't 'just a random color', Rothko is about giving you a feeling through means that aren't normal or representational, and the unusualness of this (reasonably successful) effort is what gave the work its valuation.
Ownership of the experience by a particular artist isn't the point. Rothko isn't solely celebrity worship and speculation. Picasso isn't all of Cubism. Art is things other than property of particular artists.
What makes it awkward is the great ease by which AI can blindly and unhelpfully wear the mask of an artist, such as Basquiat, to the detriment of art. It's HOW you use the tools, and it's possible to abuse such tools.
> You can generate infinite recognizable Basquiat from an AI, but is it Basquiat? No, of course not, because Basquiat's style operates within the context of a specific individual human making a point about expectations and the interface between his race and his artistic boldness and audacity as experienced by his wealthy audience.
I'm not sure how I feel about this - I agree with the conclusion, but not the reasoning. For me, AI-generated Basquiat is not Basquiat simply because he had no ownership or agency in the process of its creation.
It feels like an overly romantic notion that art requires specific historical/cultural context at the moment of its creation to be valid.
If I could hypothetically pay Basquiat $100 to put his own work into a stable diffusion model that created a Basquiat-esque work, that's still a Basquiat. If I could pay him to draw a circle with a pencil, that's his work - and if I used it in an AI model, then it's not.
It's about who held the paintbrush, or who delegated holding the paintbrush, not a retrospectively applied critical theory.
On reflection, I'm going to say 'nope'. Because it's Basquiat, I'm pretty sure you couldn't get him to make a model of himself (maybe he would, and call it 'samo'?). I don't think you could pay him to draw a circle with a pencil: I think he'd have been offended and angry. And so that is not 'his work'. It trips over what makes him Basquiat, so doing these things is not Basquiat (though it's very, very Warhol).
Even more than that, you couldn't do Rothko that way: the man would be beyond offended and would not deal with you at all. But by contrast, you ABSOLUTELY are doing a Warhol if you train an AI on him and have it generate infinite works, and furthermore I think he'd be absolutely delighted at the notion, and would love exploring the unexplored conceptual space inside the neural net.
In a sense, an AI Warhol is megaWarhol, an unexplored level of Warholiness that wasn't attainable within his lifetime.
Context and intent matter. All of modern art ended up exploring these questions AS the artform itself, so boiling it down to 'did a specific person make a mark on a thing' won't work here.
This seems to me to confuse agency with interpretation - romanticising the life and character of the artist after their heyday and death, talking about what they would have done.
Any drawing Basquiat did is a piece of art by Basquiat, whether or not it fits into the narrative of a book/thesis/lecture/exhibition. The circle metaphor isn't important - replace it with anything else. Artists regularly throw their own work away. Some of this is saved and celebrated posthumously, some never sees the light of day in accordance with their wishes. Scraps that fell on Picasso's floor sell for huge amounts of money.
Does everything he did fit the "brand" that some art historians have labelled him with, or the "brand" that auction houses promote to increase value, or the "brand" which a fashion label licenses for t-shirts? No, but I suspect this is probably what you are talking about ie. a "classic" Basquiat™ with certificate of authenticity?
Human artists cannot produce thousands of works in a few hours.
This arguments come up in every thread, and I'm baffled that people don't think the scale matters.
You may also be observed in public areas by police, but it would be an orwellian dystopia to have millions of cameras in spaces analyzing everyone's behavior in public.
Scale matters.
(But I'm indeed in favor of weaker copyright laws! But preferably to take power away from the copyright monopolies than the individual artists who barely get by with their profits)
> It would be interesting to see if this technology can erode the copyright concept a bit
Copyright law (especially in US) only ever changes in the direction that suits corporations. So - no.
What I expect instead is artists being sued by a big tech company for copyright violations because that big tech company used the artist Public Domain image for training their copyrighted AI and as a result it created a copyrighted copy of the original artist's image.
My bet is that big corporations won’t risk suing anyone over a supposed copyright on generated images,as there is a good chance that a court ends up stating that all AI generated images are in fact public domain (no author, not from the original intent and idea of a human)
You can already see the quite strange and toned down language they use on their sites. (And for some the revealing reversal from we licence to you to you licence to us)
Some smaller AI companies might believe they own a clear cut copyright and sue, but it would make sense that they would either be thrown out or loose
So, the US Copyright Office will already refuse to issue a copyright for text-prompt-generated AI art, at least if you try a stunt like naming the artist to be the AI program itself.
However, even if an image is not copyrightable, it can still infringe copyright. For example, mechanical reproductions of images are not copyrightable in the US[0] - which is why you even can have public domain imagery on the web. However, if I scan a copyrighted image into my computer, that doesn't launder the copyright away, and I can still be sued for having that image on my website.
Likewise, if I ask an AI to give me someone else's copyrighted work[1], it will happily regurgitate its training set and do that, and that's infringement. This is separate from the question of training the AI itself; even if that is fair use[2], that does nothing for the people using the AI because fair use is not transitive. If I, say, take every YouTube video essay and review on a particular movie and just clip out and re-edit all the movie clips in those reviews, that doesn't make my re-edit fair use. You cannot "reach through" a fair use to infringe copyright.
[0] In Europe there's a concept of neighboring rights, where instead of issuing you a full copyright you get 20 years of ownership instead. This is intended for things like databases and the like. This also applies to images; copyright over there distinguishes between artistic photography (full copyright) and other kinds of photography (20 years neighboring right only). This is also why Wikimedia Commons has a hilarious amount of Italian photos from the 80s in a special PD-Italy category.
[1] Which is not too difficult to do
[2] My current guess is that it is fair use, because the AI can generate novel works if you give it novel input.
> So, the US Copyright Office will already refuse to issue a copyright for text-prompt-generated AI art, at least if you try a stunt like naming the artist to be the AI program itself.
That’s because only humans can own copyrights. People can and have registered copyrights for Midjourney outputs.
> Copyright law (especially in US) only ever changes in the direction that suits corporations. So - no.
There's certainly arguments to be made in this direction, for example corporations tending to have the most money they can afford to spend on lobbying to get their way, but the attitude of "it hasn't been good up 'til now so it definitely can't ever be good" is pretty defeatist and would imply that positive change is impossible in any area.
In this situation, it would seem like the suit would end up at "comparing the timestamp at which the public domain and copyrighted versions were published", wouldn't it ?
There is nothing that the generative AI can do in this process that's legally different from copy pasting the image, editing it a bit by hand, and somehow claiming intellectual property of the _initial_ image, no ?
In theory yes, in practice you have to pay your legal expanses in US even if you win the case. Which means you can bankrupt because a big company thought you infringed on their rights even if you didn't. Simply because you can't afford the costs.
Doesn't your argument in the first paragraph assume that the methods by which humans derive new works from past experiences is equivalent to the way statistical models iteratively remove noise from images based on a set of abstract features derived from an input prompt?
That seems to be the core of the issue, and a much more interesting conversation to have. So why do I keep seeing a version of your first paragraph everywhere and not an explanation on why the assumption can be made?
The problem is not that people aren't owning ideas hard enough, ideas shouldn't be ownable in this way, the problem is that we've created a system that's obsessed with scarcity and collecting rents. Being able to own and trade ideas a la copyright/patents helps people who can buy copyrights and patents stifle creativity more than it helps artists gather reward for their creation (though it does both).
Human endeavor is inherently collaborative. The idea that my art is my virgin creation is an illusion perpetuated by capitalists. My art is the work of thousands who came before me with my slight additions and tweaks.
Your (and in general, our) suggestion that we should be concerned with respecting or even expanding these protections is incorrect if you want human creativity to flourish.
You misunderstand me. I am strongly in favor of abolishing all intellectual property restrictions. Here is me arguing just that two days ago:
https://news.ycombinator.com/item?id=33697341
But I am absolutely not in favor of keeping IP restrictions in place and then letting big corporations scoop up the works of small independent artists for their ML models.
Think of it in terms of software licenses. The people who write GPL protected software are leveraging existing copyright laws to enforce distribution of their code. They would probably be in favor of abolishing the entire IP rights system. But if a big corporation was copying a project from an independent creator that was GPL licensed, they’d sure as hell want to prosecute.
I believe strongly that IP restrictions are harmful. But keeping them in place while letting big corporations benefit from the work of independent artists who don’t want their work used in this way seems wrong to me. As long as artists wouldn’t expect anyone else to be able to copy their works, I’d like them to be able to consent to their work being used in these systems.
Ahh, I don't think that stance is evident from the GP but fair enough. I may even have a less fervent hate for IP protections than you do.
> But keeping them in place while letting big corporations benefit from the work of independent artists who don’t want their work used in this way seems wrong to me.
I see what you're saying here. My concern is that should copyright style protection be extended to the "vibe" or "style" of a painting it is going to be twisted in a way that ends up being used to silence/abuse artists in the same way that copyright strikes are already.
I think the idea that art is mostly individually creative vs mostly drawing upon the work of all the artists and art-appreciators around you and before you is already really problematic. The corrupting power of the idea is what I worry about. Similarly to crypto/NFTs, the idea that scarcity should exist in the digital world is the most dangerous thing, most of the other bad stems from that.
IMO the most important thing to work on is getting people to reject the idea itself as harmful.
I worry that any short term fix to try to prop up artists' rights in response to this changing landscape will become a long term anchor on our society's equity and cultural progress in the exact same way copyright is.
When I was younger, I also thought that way. I also felt that being artist has nothing to with money: a true artist will always create out of their internal need, not for money.
Then came the brutal reality: creating high-quality artwork needs time. Some can be created after work, but not that much. Some forms of art require expensive instruments. Some, like filmmaking, require collaboration and coordination of many people. So yes, I could do some forms of art part-time using the money from my day job, but I knew it was a far cry from what I could do when working on it full time. It's not capitalism, it's just reality.
Yeah, if you want artists to be able to devote their lives to their craft and reach the highest possible levels, they have to get paid enough to do that.
If all artists are "weekend warriors", they will still produce a lot of art, and some of it will be the best in that world. But the quality will be far from what we enjoy today.
That said, there are of course other ways to pay artists than the capitalist way of having customers pay for what they like. But I think the track record firmly favors a capitalist system.
It's almost like "capitalism" isn't something that needs to be created and forced upon people, it's just the way a world where energy isn't free and can not be created from thin air works. Capitalism is just that, the realization that there's no free lunches and no UBIs are possible without some serious unintended consequences. I pirate everything I consume, but I would never be such an hypocrite to say that all copyright must be abolished.
What? No. Capitalism is a more specific system for organizing goods and services, wherein the means of production and distribution of those goods and services (buildings, land, machines and other tools, vehicles etc) are privately owned and operated by workers (who are paid a wage) for the profit of the owners. That's only been the norm for a few hundred years, and only in certain places. Also, capitalism is separate from copyright and other IP, though IP as currently implemented is pretty obviously a capitalist concept.
At the moment I'd rather not get involved in an online argument about which economic systems are better than than which other ones... especially not on a forum run by a startup accelerator, with a constraint that my preferred system has to be more than 300 years old.
I just wanted to point out that capitalism is in fact a specific economic system. It's not a law of nature, or another word for "markets" or "freedom", or a realization that some other system doesn't work.
That's one of the great victories of capitalism: somehow it has convinced people that a 300 year-old economic system originating in north-western Europe is as natural as the air we breathe, and as inevitable as gravity or any natural law.
You have to threaten to shoot people to get them to practice any other -ism.
So, yes, capitalism in the sense of the freedom to trade one's labor does appear to be naturally and universally emergent in advanced human societies, in the absence of violent interference.
Capitalism has violent coercion at its core, in order to enforce its property rights. You simply think that that violence is legitimate and unproblematic because you believe the system it upholds is "natural" and legitimate, but at this point you're arguing in circles. But to say that capitalism is not violent is laughable.
Yes, it is. The violence comes in when you interfere with capitalism. It's not imposed upon you forcefully, you just aren't allowed to get in the way.
To the extent that certain aspects of capitalism lead to violence, those are elements that other parties -- generally corporations or governments rather than writers or philosophers -- added to the ideology.
People die trying to break out of non-capitalist countries, while they die trying to break in to capitalist ones. That's one possible way to tell the good guys from the bad guys.
(Shrug) Taking peoples' rights away, including their economic rights, is likely to get the hurt put on you. Ric Romero has more on this late-breaking story at 11.
It sounds funny but he may have a point.
It's not a quality of capitalism per se, had it been communism instead then communism would have been the best system for the present moment.
But capitalism prevails and may be the best system there is for now because I cannot fathom a change in system overnight that would not result in mass suffering for (almost) everyone.
The restrictions on creating art are the product of the society you live in, which means they are the product of capitalism if you live in a capitalist society. The way society is organised determines the cost of people's time, the cost of the tools, and the cost of the materials.
Yea I find when people say "ideas shouldn't be ownable" it's really the more general "deriving profit from private ownership was a mistake". Like you kinda point out, most of the reason I can think of that a person would want control of their intellectual property is to derive profit from it.
That reason has nothing to do with intellectual property or how it's created, it's a consequence of living in a capitalist society.
So anybody who just wanted a thing to exist, and don't care who gets the credit, aren't "real artists"? You must not work on any large art projects that involve other people.
99%? You might have it in reverse because most art is not produced by "fulltime" artists. I would even go as far and say 99% of art is not produced to earn money.
I've seen many arguments about getting laws on the books around ML learning. I would suggest people create a project that creates movies using ML and train it using existing Hollywood movies. I realize this isn't easy but the issue needs to be pushed to people that have the means to force change.
If you can't process/digest copyrighted content with algorithms/machine learning then Google Search (the whole thing, not just Image Search) is dead.
So no, it's not at all clear where the legal lines are drawn. There have been no court cases yet, regarding the training of ML models. People are trying to draw analogies from other types of cases, but this has not been tried in court yet. And then the answer will likely differ based on country.
> If you can't process/digest copyrighted content with algorithms/machine learning then Google Search (the whole thing, not just Image Search) is dead.
Not if Google honors the robots.txt like they say they do. Hosting content with a robots.txt saying "index me please" is essentially an implicit contract with Google for full access to your content in return for showing up in their search results.
Hosting an image/code repository with a very specific license attached and then having that licensed ignored by someone who repackages that content and redistributes it is not the same as sites explicitly telling Google to index their content.
A much closer comparison IMO would be someone compressing a massive library of copyrighted content and then redistributing it and arguing it's legal because "the content has been processed and can't be recovered without a specific setup". I don't think we'd need prior court cases to argue that would most likely be illegal, so I don't see how machine learning models differ.
LAION/StableDiffusion is already legal under the same exemptions as Google Image Search and does respect robots.txt. It was also created in Germany so US court cases wouldn’t apply to it.
Well, you can learn about generative models from MOOCs like the ones taught at UMich, Universitat Tubingen, or New York University (taught by Yann LeCun), and can gain knowledge there.
You can also watch the fast.ai MOOC titled Deep Learning from Scratch to Stable Diffusion [0].
You can also look at open source implementation of text2image models like Dall-E Mini or the works of lucid rain.
I worked on the Dall-E Mini project, and the technical knowhow that you need isn’t closely taught at MOOCs. You need to know, on top of Deep Learning theory, many tricks, gotchas, workarounds, etc.
You could follow the works of Eluther AI, follow Boris Dayma (project leader of Dall-E Mini) and Horace Ho on twitter. And any such people who have significant experience in practical AI and regularly share their tricks. The PyTorch forums is also a good place.
If you're talking about training from scratch and not fine tuning, that won't be cheap or easy to do. You need thousands upon thousands of dollars of GPU compute [1] and a gigantic data set.
I trained something nowhere near the scale of Stable Diffusion on Lambda Labs, and my bill was $14,000.
[1] Assuming you rent GPUs hourly, because buying the hardware outright will be prohibitively expensive.
I have... ~11TBs of free disk space and a 1080ti. Obviously nowhere close to being able to crunch all of Wikimedia Commons, but I'm also not trying to beat Stability AI at their own game. I just want to move the arguments people have about art generators beyond "this is unethical copyright laundering" and "the model is taking reference just like a real human".
To put things in perspective, the dataset it's trained on is ~240TB and Stability has over ~4000 Nvidia A100 (which is much faster than a 1080ti). Without those ingredients, you're highly unlikely to get a model that's worth using (it'll produce mostly useless outputs).
That argument also makes little sense when you consider that the model is a couple gigabytes itself, it can't memorize 240TB of data, so it "learned".
As pointed out in [1], it seems machine learning takes the same path as physics already did. In the mid-20th century there was a "break" in physics, before individuals were making ground breaking discoveries in their private/personal labs (think Newton, Maxwell, Curie, Roentgen, Planck, Einstein, and many others) later huge collaborations (LHC/CERN, Icecube, EHT, et al.) are required, since the machinery, simulations, models are so complex, that groups of people are needed to create, comprehend and use them.
P.S. To counteract that (unintentionally actually, likely because of a simple optimization of instruments' duty cycle) in astronomy people come up with a concept of "observatory" (Like Hubble, JWST) instead of "experiment" (like LHC, HESS telescopes) where outside people can submit their proposals, and if selected get observational time. Along with raw data authors of the proposals get required expertise from the collaboration to process and analyze that data.
The point is that there's is no practical limit on compression. You don't need "AI" or anything besides very basic statistics to get astronomical compression ratios. (See: "zip bomb".)
The only practical limit is the amount of information entropy in the source material, and if you're going to claim that internet pictures are particularly information-dense I'd need some evidence, because I don't believe you.
Correct, however "compression is equivalent to general intelligence" (http://prize.hutter1.net/hfaq.htm#compai ) and so in a sense, all learning is compression. In this case, SD applies a level of compression that is so high that the only way it can sustain information from its inputs is by capturing their underlying structure. This is a fundamentally deeper level of understanding than image codecs, which merely capture short-range visual features.
Most human behavior is easy to describe with only a few underlying parameters, but there are outlier behaviors where the number of parameters grows unboundedly.
("AI" hasn't even come close to modeling these outliers.)
Internet pictures squarely falls into the "few underlying parameters" bucket.
Because we made the algorithms and can confirm these theories apply to them.
We can speculate they apply to certain models of slices of human behaviour based on our vague understanding of how we work, but not nearly to the same degree.
Hang on, but- plagiarism is a copyright violation, and that passes through the human brain.
When a human looks at a picture and then creates a duplicate, even from memory, we consider that a copyright violation. But when a human looks at a picture and then paints something in the style of that picture, we don't consider that a copyright violation. However we don't know how the brain does it in either case.
How is this different to Stable Diffusion imitating artists?
Well that would be ~4000 people each with an Nvidia A100 equivalent, or more with less, this would be an open effort after all. Something similar to folding@home could be used. Obviously the software for that would need to be written, but I don't think the idea is unlikely. The power of the commons shouldn't be underestimated.
It's not super clear whether the training task can be scaled in a manner similar to protein folding. It's a bit trickier to optimise ML workflows across computation nodes because you need more real time aggregation and decision making (for the algorithms).
A100 costs 10-12k USD 40GB/80GB vram and it's not even targeted at the individual gamer (not effective at gaming) -- they don't even give these things to big YouTube reviewers(LTT). So 4k people will be hard to find. 3090, you can find, that's a 24GB vram card. But that's expensive too and it's a power guzzler compared to the A100 series.
AFAIK. This is not possible at the moment and would need some breakthrough in training algorithms, the required bandwidth between the GPUs is much higher than internet speed.
> That argument also makes little sense when you consider that the model is a couple gigabytes itself, it can't memorize 240TB of data, so it "learned".
The matter is really very nuanced and trivialising it that way is unhelpful.
If I recompress 240TB as super low quality jpgs and manage to zip them up as single file that is significantly smaller than 240TB (because you can), does the fact they are not pixel perfect matches for the original images mean you’re not violating copyright?
If an AI model can generate statistically significantly similar images from the training data, with a trivial guessable prompt (“a picture by xxx” or whatever) then it’s entirely arguable that the model is similarly infringing.
The exact compression algorithm, be it model or jpg or zip is irrelevant to that point.
It’s entirely reasonable to say, if this is so good at learning, why don’t you train it without the art station dataset.
…because if it’s just learning techniques, generic public domain art should be fine right? Can’t you just engineer the prompting better so that it generates “by Greg Rutkowski“ images without being trained on actual images by Greg?
If not, then it’s not just learning technique, it’s copying.
So; tldr: there’s plenty of scope for trying to train a model on an ethically sourced dataset, and investigation of techniques vs copying in generative models.
If I recompress 240TB as super low quality jpgs and manage to zip them up as single file that is significantly smaller than 240TB (because you can), does the fact they are not pixel perfect matches for the original images mean you’re not violating copyright?
If you compress them down to two or three bytes each, which is what the process effectively does, then yes, I would argue that we stand to lose a LOT as a technological society by enforcing existing copyright laws on IP that has undergone such an extreme transformation.
Does that mean it’s worthless to try to train an ethical art model?
Is it not helpful to show that you can train a model that can generate art without training it on copyrighted material?
Maybe it’s good. Maybe not. Who cares if people waste their money doing it? Why do you care?
It certainly feels awfully convenient for that there are no ethically trained models because it means no one can say “you should be using these; you have a choice to do the right thing, if you want to”.
I’m not judging; but what I will say is that there’s only one benefit in trying to avoid and discourage people training ethical models:
…and that is the benefit of people currently making and using unethically trained models.
We don't agree on what "ethical" means here, so I don't see a lot of room for discussion until that happens. Why do you care if people waste computing time programming their hardware to study art and create new art based on what it learns? Who is being harmed? More art in the world is a good thing.
> Can’t you just engineer the prompting better so that it generates “by Greg Rutkowski“ images without being trained on actual images by Greg?
You couldn't teach a human to do that without them having seen Greg's art. There are elements of stroke, palette, lightning and composition that can't be fully captured by natural language (short of encoding a ML model, which defeats the point).
Copyrights say you cannot reproduce, distribute, etc a work without consent from the author, whatever the mean. The copy doesn't need to be exact, only sufficiently close.
However, copyright doesn't prevent someone to look at the work and study it. Even study it by heart. Infringement comes only if that someone would make a reproduction of that work. Also, there are provision for fair use, etc.
> …because if it’s just learning techniques, generic public domain art should be fine right? Can’t you just engineer the prompting better so that it generates “by Greg Rutkowski“ images without being trained on actual images by Greg?
Is it fair to hold it to a higher standard than humans though? To some degree it's the whole "xxx..... on a computer!" thing all over again if we go that way
> The matter is really very nuanced and trivialising it that way is unhelpful.
Harping about copyrights in the Age of Diffusion Models is unhelpful (for artists) like protesting against a tsunami. It's time to move up the ladder.
ML engineers have a similar predicament - GPT-3 like models can solve at first try, without specialised training, tasks that took a whole team a few years of work. Who dares still use LSTMs now like it's 2017? Moving up the ladder, learning to prompt and fine-tune ready made models is the only solution for ML eng.
The reckoning is coming for programmers and for writers as well. Even scientific papers can be generated by LLMs now - see the Galactica scandal where some detractors said it will empower people to write fake papers. It also has the best ability to generate appropriate citations.
The conclusion is that we need to give up some of the human-only tasks and hop on the new train.
I think it's a great idea regardless of practicality / implementation which I think is generally understood to be largely a matter of time, money and hardware. I feel like you write it up so the idea gets out there or you can pitch it to someone if the opportunity arises.
Oh and also I second the fast.ai suggestion, part 2 is 100% focused on implementing stable diffusion from scratch in the python standard library and it's amazing all around. The course is still actively coming out but the first few lessons are freely available already and the rest sounds like it will be made freely available soon.
Can you go into a bit more detail?
What architecture did you use? Is the month training time really just training with mini batches with a constant learning rate? Or are these many failed attempts until you trained a successful model for a few days in the end?
I particularly interested in the image generation part (the DDPM/SGM)
Yeah I did have a few false starts. Total time is more like 3 months vs 1 month for the final model. For small scale training I found it’s necessary to use a long lr warmup period, followed by constant lr.
There’s code on my GitHub (glid3)
edit: The architecture is identical to SD except I trained on 256px images with cosine noise schedule instead of linear. Using the cosine schedule makes the unet converge faster but can overfit if overtrained.
edit 2: Just tried it again and my model is also pretty bad at hands actually. It does get lucky once in a while though.
What kind of form factor do you use for 4x3090? Don't people usually use the datacenter product line when they're trying to get more than one into a box?
The datacenter cards are 3-4x the price for the same speed + double the vram. Gaming cards are a lot more cost effective if your model fits in under 24gb.
I use an open air rig like the ones used for crypto mining. 4x3090 would normally trip the breakers without mods but if you under volt the cards the power draw is just under the limit for a home AC outlet.
> Specifically, Wikimedia Commons images in the PD-Art-100 category, because the images will be public domain in the US and the labels CC-BY-SA.
Doesn't the "BY" part of the license mean you have to provide attribution along with your models' output[0]? I feel you'll have the equivalent of Github Copilot problem: it might be prohibitive to correctly attribute each output, and listing the entire dataset in attribution section won't fly either. And if you don't attribute, your model is no different than Stable Diffusion, Copilot and other hot models/tools: it's still a massive copyright violation and copyright laundering tool.
I feel quite strongly that there is a large difference between Stable Diffusion and Copilot: with the size of the training set vs the number of parameters, it should be very difficult if not impossible for Stable Diffusion to memorize and, by extension, copy paste to produce its outputs. Copilot is trained on text and outputs text. Coding is also inherently more difficult for an AI model to do. I expect it will memorize large portions of its input and is copy pasting in many cases to produce output. I therefore believe Copilot is doing "copyright laundering" but Stable Diffusion is not. Furthermore, I do not believe, for example, that artists should be able to copyright a "style" - but I would like to see them not be negatively impacted by this. Its complicated.
Let me guess that you write more code than visual art?
Isnt it a bit anthropomorphic to compare the two algorithms by "how a human believes they work" instead of "what they're actually doing different to the inputs to create the outputs"?
These are algorithms and we can look at how they work, so it feels like a cop-out to not do that.
If I was generating image labels I absolutely would need to worry about that. However, since we're only generating images alone, we don't need to worry about bits of the labels getting into the output images.
The attribution requirement would absolutely apply to the model weights themselves, and if I ever get this thing to train at all I plan to have a script that extracts attribution data from the Wikimedia Commons dataset and puts it in the model file. This is cumbersome, but possible. A copyright maximalist might also argue that the prompts you put into the model - or at least ones you've specifically engineered for the particular language the labels use - are derivative works of the original label set and need to be attributed, too. However, that's only a problem for people who want to share text prompts, and the labels themselves probably only have thin copyright[0].
Also, there's a particular feature of art generators that makes the attribution problem potentially tractable: CLIP itself was originally designed to do image classification. Guiding an image diffuser is just a cool hack. This means that we actually have a content ID system baked into our image generator! If you have a list of what images were fed into the CLIP trainer and their image-side outputs[1], then you can feed a generated image back into CLIP and compare the distance in the output space to the original training set and list out the closest examples there.
[0] A US copyright doctrine in which courts have argued that collections of uncopyrightable elements can become copyrightable, but the resulting protection is said to be "thin".
[1] CLIP uses a "dual headed" model architecture, in which both an image and text classifier are co-trained to output data into the same output parameter space. This is what makes art generators work, and it can even do things like "zero-shot classification" where you ask it to classify things it was never trained on.
>If I was generating image labels I absolutely would need to worry about that. However, since we're only generating images alone, we don't need to worry about bits of the labels getting into the output images.
Just to be correct, SD generates labels on images sometimes, so, we need to worry ;)
This is not possible because the model is smaller than the input weights. Just as any new image it generates is something it made up, any attributions it generated would also be made up.
CLIP can provide “similarity” scores but those are based on an arbitrary definition of “similarity”. Diffusion models don’t make collages.
> Writing a training loop for CLIP manually wound up with me banging against all sorts of strange roadblocks and missing bits of documentation, and I still don't have it working.
But training multi-modal text-to-image models is still a _very_ new thing, in terms of the software world. Given that, my experience has been that it's never been easier to get to work on this stuff from the software POV. The hardware is the tricky bit (and preventing bandwidth issues on distributed systems).
That isn't to say that there isn't code out there for training. Just that you're going to run into issues and learning how to solve those issues as you encounter them is going to be a highly valuable skill soon.
edit:
I'm seeing in a sibling comment that you're hoping to train your own model from scratch on a single GPU. Currently, at least, scaling laws for transformers [0] mean that the only models that perform much of anything at all need a lot of parameters. The bigger the better - as far as we can tell.
Very simply - researchers start by making a model big enough to fill a single GPU. Then, they replicate the model across hundreds/thousands of GPU's, but feed each on a different set of the data. Model updates are then synchronized, hopefully taking advantage of some sort of pipelining to avoid bottlenecks. This is referred to as data-parallel.
All this horsepower deployed to image generation is interesting but somebody wake me up when there is a stable diffusion for SQL or when on demand generative User Interfaces are spun up on the fly to suit the purpose.
It will be worthwhile to use images from commons. I have found that my photography is used in the stable diffusion data set. What was funny is that they have taken the images from other URLs than my flickr account.
I don’t know anybody that is blown away by keyboard auto suggest. It’s wrong as often as it is right. Not saying it isn’t useful, but let’s not oversell it.
Lol. Especially the AI version of keyboard auto suggest.
Let's take a deterministic algorithm that predictably corrects your typos and build it on AI. It will offer you no benefits, but it will completely destroy the utility since it will never work predictably or accurately.
My comment would remain exactly the same for auto-correct. They are essentially the same thing, just pre and post typing.
They both serve the same purpose of helping the user quickly and accurately communicate on a cell phone. Like auto-suggest, I rely on auto-correct to fix things that I know I commonly mistype. When it doesn't work predictably, it's useless.
Honestly I was quite surprised at how regular people are impressed by this tech. I was also surprised by how little regular people are aware of this tech even existing.
We, on hackernews, on a thread about Stable Diffusion, are of course not too unimpressed.
This is universally the result that I see from my nontechnical friends; Apple has literally all the money and Siri has the listening comprehension of a drunk beagle.
Maybe blown away at how terrible it is... like how many times do we need to correct it for it to show us the same shitty suggestions. I'm not sure I'd even notice if it were turned off.
I'm in a kind of same boat. I think indie games are the way to show true potential of SD.
Hence, I'm working on http://diffudle.com/ which is a mix of Wheel Of Fortune + Stable Diffusion + Wordle. I Can't figure it out but feels to me like its lacking something.
> Hence, I'm working on http://diffudle.com/ which is a mix of Wheel Of Fortune + Stable Diffusion + Wordle. I Can't figure it out but feels to me like its lacking something.
Very creative and a fun way to interact with SD. I would encourage you to explore this idea further, as interest in SD might grow and people want to engage with the topic in an accessible way. I like the idea of hard-limiting play (1 quizz per day) but a small backlog of previous pictures could be nice to explore a little.
Wow, that's really cool actually, have to bookmark it! One feature I would add would be going back through the previously shown images, that would make it easier to guess what they have in common. Also, larger images would look nicer, but I guess that would drive the costs up?
Thanks!! I'm debating whether to show history of images as it will reduce the difficulty by a lot. Larger images is a great suggestion, I'll add them ASAP.
It confused me that the letter boxes were divided in 7+3, thus I thought it would be two words while the correct answer was a single 10 letter word. Maybe try to avoid wrapping words.
Nice Observation!! I'm thinking including a start and end mark to improve the UX would work well. I can't avoid wrapping as the prompt might be very large.
There just isn’t a lot of market opportunities where being right 99% of the time is good enough. If you are operating at scale and 1/100 decisions are wrong, the outcome is poor and often highly off-putting to users.
It’s possible this time is different, but people at my company were entertained by DALLE for all of 5 minutes before no one ever mentioned it again. The value proposition is simply low.
Are you kidding? Many times corporate decisions are being made effectively at random. Thinking that the average company operates with a 999 batting average is a total fantasy.
When our c suite decides on an ad campaign and tells our artists to draw normal humans, those people have 3 legs or upside down teeth exactly 0% of the time. Humans have many many limitations, but with every model I’ve tested there’s a set of errors that would virtually never be made by any human.
I think it's interesting that drawing too many fingers is a mistake kids make, too, although with less photorealism otherwise. I guess there's a reason all thosr famous artists drew hundreds of hands as practice as well.
As an outsider, this rings true to me. I still don’t see any reduction of hours involved in producing professional level works. Generating YouTube thumbnails, sure.
I think this analogy doesn't hold water - horses aren't exactly a beacon of reliability (having owned one).
I've already seen tools that support workflows where you compose art by iteratively generating a piece of it, performing some correction, and repeating. So, I think there's room in the art world for less than perfectly generated art. That said, let's not kid ourselves that the typical failure modality of ML today (99% correct enough, 1% disastrously incorrect) doesn't either cause it to be entirely useless in many applications or end up wreaking havoc on end users in others.
It's only an analogy, but it serves to underscore the last point you make. Initial versions of the technology can make some genuine horrors but you're blinding yourself to progress if you can't see the potential in it.
It started as an AI-powered MS paint for my son. But after demoing it to a few coworkers, it morphed into a bit more than that. Now it’s more of a storybook creator that young kids can use to generate their own stories.
Not looking to monetize at all. But inference is expensive. So might have something to cover costs.
Some backstory:
When I was growing up in the early 90s, my dad took me into his office over the weekends when he was doing some overtime paperwork. I would be on his IBM Windows 3.1 workstation. He didn’t have any games on his work computer, so I would spend the entire day “playing” with MS Paint. I couldn’t read yet (3-4 years old), but I was able to figure it out.
We didn’t have a computer at home. But seeing how I was so good at it, my parents bought one. I eventually got into coding etc. All of this defined who I am today.
So I wanted to recreate some of this magic, for my own son. He’s 3 months old, so not quite the right age. But I have some free time on parental leave. So why not. Might be useful for parents with 3-5 year olds.
>Be kind. Don't be snarky. Have curious conversation; don't cross-examine. Please don't fulminate. Please don't sneer, including at the rest of the community. Edit out swipes.
Comments should get more thoughtful and substantive, not less, as a topic gets more divisive.
AI mobile keyboard suggestions, aren't really so good that I would be impressed with it, especially because it gets it wrong so many times, although to be fair, I do write in the languages, which I'm quite sure doesn't help the AI in the slightest.
Last time we had people in the field of AI writing on HN expecting another AI Winter. I thought all the improvement we have now, and we might get in the future like Stable Diffusion 3.0 and possibly many others, while not L5 Self Driving Car or General AI, has yet to be fully used outside of Tech community, or even in Tech itself. The long tail of distribution and positive feedback loop will sustain its development for another 10 years.
I guess you need to look at the current TAM for visual production in general.
That's the baseline (so includes visualization studios, agencies, game studios etc). Generally this potentially can help in many labour intensive parts of a creative visual process.
Is that a "large" organization market or not depends on your metric and what the market positioning of the offering is. I would see applications in both specialist content creation tools as well as "stock photos and merch".
In terms of finding stock photos, if you add a better text api that is easier to control this probably can compete with static stock photos in the sense that people can tune their images as much as they like. For example with their corporate merch (Imagine producing a slideset at Acme co. "Please give me an elephant and walrus wearing acme caps".
Ad agencies already love that they can train a model to quickly iterate product shot ideas extremely rapidly.
Then we have "the usual" effect automation has on market demand - automation increases the productivity of a task requiring labour, hence allowing to reduce the cost of a unit of production, which generally increases the demand. I.e. creative stuff will be cheaper to do, you won't replace artists, but suddenly the dude or dudette who spent hours just tweaking stuff has their own art studio at finger tips to command. They can get so much more done much faster.
The tech is not 100% bullet proof yet but at this pace it will be good enough soon (or probably is for several applications if there was just an UX sugaring targeting specific domain workflow).
Does it actually make anything in the corporate world better to use generated images in slides? When coworkers use stock photos which were presumably made by humans operating actual cameras, I don't think it's clear that their presentation is actually more valuable as a result.
I suspect those applications will come from specializing the model. For example, there's people that have avatar generators or automated ad creatives. A cool application I've been toying with is generating icons.
I think it's looking fairly similar, the first one was a bit tricky too. Later improvements by the community made it clearer.
The docs aren't good though, it tells you to download two things when actually I think you only need one. If you do need two then it doesn't tell you at all where to put the second.
You really need xformers if you're doing it at home, I've got a 3090 and it blew through the ram without it. However, the instructions didn't work for me for compiling and there's an incompatibility if you try and install from conda. You can have it work but you need to upgrade python from 3.8.5 to 3.9 in the yaml file first, then you can install it (xformers needs 3.9+, and something else in SD breaks on 3.10+ so 3.9 works).
This needs the classic "sit next to a new person installing it by following the docs and see what problems they hit, fix the docs and start from scratch again" process.
Looks good, though so far the images I've made don't look as nice as with 1.4, but I guess that's largely down to finding the right tweaks for the model and right magic wording for the prompts.
Is it? For me, I'm in tech but nowhere near anything for which playing with a new SD release would be relevant to my day job. Having a couple of extra days off to play with a new tech toy probably means I'll use it more.
I wonder why these AI repos' documentation are so bad compared to what we are used to in general. Where is intro/get started/example(commands)/config(docs) etc.
I believe it's a combination of the fact that most of these models are basically 'research dumps' primarily targeting other researchers and given this they are assuming a level of familiarity with related tools/libraries. So it's up to interested people in the community to take it the last mile/block/whatever to make it easy to use, address specific use cases etc. for use by a less academic/technical audience.
In addition to removing NSFW images from the training set, this 2.0 release apparently also removed commercial artist styles and celebrities [1]. While it should be possible to fine tune this model to create them anyway using DreamBooth or a similar approach, they clearly went for the safe route after taking some heat.
I predicted back when they started backpedaling that there's a chance that sd1.4 or 1.5 will be the best available model to the general public, for a very long duration, because the backlash will force them to self-castrate themselves.
You can see nobody likes this new model in any of the stable diffusion communities. It's a big flop and for a good reason. The reason it was so
successful in the first place was because you could combine artist names to get the model to the outcome you want.
I'll again remind anyone who thinks they might want to use this to download a working version of SD now. They might break their own libraries in the future, and getting SD1.4 could be a real hassle in a year or so. Getting the right .ckpt file, which can have pickled python malware, is not so trivial, and this will get worse in time.
It's going to diverge into castrated official model that intentionally breaks the older models and older models from unofficial shady sources that might contain malware.
These models will always work best with open datasets and open platforms for this reason.
Social media/"AI ethics" pressure groups will eventually come from these organizations (see Meta's recent debacle with Galactica). Being an unknown org without these pressures was a big reason Stable Diffusion got so popular in the first place.
That's like saying that obtaining the On the Origin of Species or Linux kernel will be harder in future. If anything the SD weights will be increasingly ubiquitous as they start embedding it into consumer electronics.
As someone completely unfamiliar with SD but interested in playing around with it in the future, what exactly should I download, to have a fully local instance of 1.4 or 1.5?
I'm not comparing with the others because I don't have experience with them, but https://invoke-ai.github.io/InvokeAI/ is great, with an easy install and active development.
Mixing artist names was by far the most effective way to create aesthetically pleasing images, this is a huge change. DreamBooth can only fine-tune on a couple dozen images, and you can't train multiple new concepts in one model, but maybe someone will do a regular fine-tune or train a new model.
That really depends on whether you mean 'like artist X' as 'aesthetically pleasing'. I was fooling around with furry diffusion and got to try a few different models. Yiffy understood artist names, and furry did not: it had further training but stripped of artist tags.
All these models are pretty good as that community is strong on art, styles, art skill, and tagging, causing the models to be a serious test case for what's possible. The model with artist names was indeed capable of invoking their styles (for instance, an artist with exceptional anatomy rendering had it translate into the AI version). The more-trained model without the artist names was much more intelligent. It was simply more capable of quality output, so long as your intention wasn't 'remind me of this artist'.
I think that's likely to be true in the general case, too. This tech is destined for artist/writer/creator enhancement, so it needs to get smarter at divining INTENT, not just blindly generating 'knock-offs' with little guidance.
What you want is better tagging in the dataset, and more personalized. If I have a particular notion of an 'angry sky', this tech should be able to deliver that unfailingly, in any context I like. Greg Rutkowski not required or invoked :)
I'd be curious how well the model still performs given such prompts. Disparate concepts, interpolation, n' all that. Surely it performs worse - but I bet it gets closer than you might think.
Removing NSFW content is fine, people who care about that can work around it easily. Removing celebrities and commercial artists was a mistake though and I expect this will need to be really impressive in other ways or people aren't going to bother using it.
It's remarkable, this sense of entitlement people have. You literally have a computer program here that can make photorealistic imagery of almost ANYTHING you ask it to, which was impossible even half a year ago, and here you are complaining that people won't use it unless it incorporates all of the protected imagery of famous artists and celebrities. Amazing.
I don't think you're using the principle of charity here ("make the best interpretation of a post"). The person isn't complain, he/she is just saying that people will probably return to v1 unless v2 has something impressive to compensate.
“Prompting” and “keywords” are not an essential part of this technology. If you like tokens, make your own tokens with textual-inversion or image inputs.
Seems the structure of UNet hasn't changed other than the text encoder input (768 to 1024). The biggest change is on the text encoder, switched from ViT-L14 to ViT-H14 and fine-tuned based on https://arxiv.org/pdf/2109.01903.pdf.
Seems the 768-v model, if used properly, can substantially speed-up the generation, but not exactly sure yet. Seems straightforward to switch to 512-base model for my app next week.
I'm disappointed they didn't push parameter count higher, but I suppose they want to maintain the ability to run on older/lower end consumer GPUs. Unfortunately it severely limits how high-quality the output can be.
They're motivating that choice via this paper: https://arxiv.org/pdf/2203.15556.pdf The paper shows that you can get better performance than gpt-3 with a much smaller model if you bump up the training time and training data like x4.
Larger models are still much better. Google's parti model can do text perfectly and follows prompts way more accurately than Stable Diffusion. It's 20B parameters and with the latest int8 optimizations it should be possible to get that running on a consumer 24GB card in theory.
I think they're looking into larger models later though
Can’t forget time it takes to run inference, even on the latest A100/H100. Generating in under e.g. ten seconds enables more use cases (and so on until high fps video is possible).
a built-in 4x upscaler: "Combined with our text-to-image models, Stable Diffusion 2.0 can now generate images with resolutions of 2048x2048–or even higher."
Depth-to-Image Diffusion Model: "infers the depth of an input image, and then generates new images using both the text and depth information." Depth-to-Image can offer all sorts of new creative applications, delivering transformations that look radically different from the original but which still preserve the coherence and depth of that image (see the demo gif if you haven't looked)
Better inpainting model
Trained with a stronger NSFW filter on training data.
For me the depth-to-image model is a huge highlight and something I wasn't expecting. The NSFW filter is a nothing (it's trivially easy to fine-tune the model on porn if you want, and porn collections are surprisingly easy to come by...).
The higher resolution features are interesting. HuggingFace has got the 1.x models working for inference in under 1G of VRAM, and if those optimizations can be preserved it opens up a bunch of interesting possibilities.
> it's trivially easy to fine-tune the model on porn if you want, and porn collections are surprisingly easy to come by
Not really surprised they did this, but be sure some communities will have it fine tuned on porn now-ish. So probably they did it for legal reasons in case illegal materials are generated and they are real companies/people with their names on the release?
I looked into it (though didn't download the models - too dodgy). One of the NSFW models that's gained traction, and gained attention because it seems to be better at generating even non-porn faces and bodies, is called "Hassan's blend". Hassan mentioned that he'd taken down an earlier checkpoint because it generated undesirable images.
Reading between the lines, it likely generated CSAM-like images even without explicit prompting for it.
To put things in perspective, the dataset it's trained on is ~240TB and Stability has over ~4000 Nvidia A100 (which is much faster than a 1080ti). Without those ingredients, you're highly unlikely to get a model that's worth using (it'll produce mostly useless outputs).
That argument also makes little sense when you consider that the model is a couple gigabytes itself, it can't memorize 240TB of data, so it "learned".
It compresses a whole image down to 1 byte, 60000:1 ratio. That's how much it is allowed to "memorise" from each input on average. Less than a pixel from a whole image.
depends on your wallet but RTX 3080 or RTX 3060 are good graphic cards with which you can create these images. If you just want to dip your toes and not spend much you can use Google Colab and rent out googles graphicscards either for free or for $10. Here's a link to a google colab that you can just run for free and that is used a lot https://colab.research.google.com/github/TheLastBen/fast-sta...
P.S. if you want to buy a graphics card, make sure to have at least 12GB VRAM
You're probably better of using Real-ESRGAN: https://github.com/xinntao/Real-ESRGAN. It's pretty solid and fast, even has portable executables you can just use as is. The upscaler that comes with stable diffusion might work for you, but I suspect it'll probably do a better job at upscaling stable diffusion output rather than a natural images (might be wrong though).
I tried this on one of my home images. I have a nice canon pro 100 printer that can print 13”x19” pictures, and my camera is a 20 megapixel Panasonic GH-5. The printer can print much higher resolution than my camera. So I did take one of my photos and process it with Real-ESRGAN to double the resolution (in each direction, so 4x pixels). The photo is a red barn with redwood trees behind it. It did well increasing the resolution of the barn. It made it look more crisp and bright. But there is an area with some trees in shadow behind the barn, and it lost detail there.
Anyway I think it would be fun to play with, just depends on the content of the image and the artists preferences. I still haven’t printed a full page of the upscaled photo but I do want to try that and see how it looks in comparison!
I dunno - I've found it useful on a bunch of images[1] but I tend to try Pixelmator Pro first because that's a simple key combination to enlarge an image and 90% of the time it's Good Enough for my purposes.
[1] The new Photo AI, on the other hand, is slow, clunky, and not infrequently glitches out wildly. But on the plus side it does combine sharpening and denoising into one workflow.
I was super unimpressed with the 1.0 release of Photo AI; in particular, the sharpening was a LOT slower than standalone. But that's fixed now, and unless Topaz starts backporting the improved models to the standalone tools -- so far, they have not -- Photo AI will get you better results.
I spent a lot of time last month using Gigapixel (actually the improved version in their new Photo AI product) last month on dozens of images for my dad's memoir. There were a couple failures where the input image was just so blurry or low-res that it couldn't be saved, but Topaz significantly improved image quality while upscaling in 90+% of cases.
Note that on some image types it tends to make things look digitally painted rather than detailed. I recommend you try a few different tools and see what works best for the type of photography you do.
They know they are going to be the next target in the war on general purpose computing. They're trying to stave it off for as long as possible by signalling to the authorities that they are the good guys.
A confrontation is inevitable, though. Right now it costs moderate sums of money to do this level of training. Not always will this be so. If I were an AI-centric organization, I would be racing to position myself as a trustworthy actor in my particular corner of the AI space so that when legislators start asking questions about the explosion of bad actors, I can engage in a little bit of regulatory capture, and have the legislators legislate whatever regulations I've already implemented, to the disadvantage of my competitors.
For people who say "people can make whatever images they like in photoshop," I will remind you of this:
https://i.imgur.com/5DJrd.jpg
This is business, not ethics, though. They just don't want the negative attention, that's it. And because this is a business, time matters, same as DRMs. Almost all DRMs get defeated, yet the work, because they hinder the crackers, even if for some time. Same here. While Stability is not under attack yet, they can establish themselves as the household name for AI in a safer context.
Banknote printing is primarily protected against on the hardware level of printers, no? With the nigh-invisible unique watermark left by every printer, there’s virtually no way you’d get away with it. My guess is that the Photoshop filter exists mostly as a barrier against the crime of convenience.
My point is that there is precedent for governments requiring companies to implement restrictions on what images can be handled by their software.
As I explained: This kind of mandated restriction is looming over AI. Companies are trying to get out in front of these restrictions so they can implement them on their own terms.
>My point is that there is precedent for governments requiring companies to implement restrictions on what images can be handled by their software
But images of boobs are still legal. So this NSFW filter seems to be much more above then the law asks. Is the issue is that even if you do not train with CP you might get the model so output something that some random person will get offended and label it as CP? I assume that other companies can focus on NSFW and have their lawyers figure this out, IMo would be cool that someone sues the governments and make them reveal facts about their concern that CP of fake or cartoon people is dangerous" , I think they could focus on saving real children then cartoon ones.
You kill way too many birds with such a stone. Of course you could never do any kind of photorealistic game in real time if you had to pre-screen everything with an actually effective censor.
Indeed, what they're already doing is already hobbling the models.
Emad is right that we learn new things from the creativity unleashed by accessible models that can be run (and even fine tuned) or consumer hardware.
But judging from what people post, one thing we learn is that it seems models fine tuned on porn (such as the notorious f222 and its derivative Hassan's blend) can be quite a bit better at non-porn generation of diverse, photorealistic faces and hands too.
> Of course you could never do any kind of photorealistic game in real time if you had to pre-screen everything with an actually effective censor.
I'm not sure I understand this. A possible implementation could be a neural net that blanked the screen with a frown face any time it detected something it thinks was "bad". What purpose/need would pre-screening serve?
That you describe IS pre-screening. And it's not workable, because it would take a ton of dedicated resources to make it work in real time, and even then it would be disastrous for latency, unworkable for most games and even desktop applications.
I think this is making the assumption that all frames are blocked.
> then it would be disastrous for latency
We're talking about the future here. I'm not sure it makes sense to use current tech to say it's not going to happen, or come up with latency numbers. But, "real time" inference is definitely a possibility, and is in active use for video moderation (Youtube, etc) and object detection (Tesla, etc). Nobody will notice a system running at 2000fps.
Very niche example here, that can be easily circumvented
Also seems problematic to approach this from a purely capitalistic and consumerist angle. There is a lot of opportunity here besides just launching the next AI unicorn.
I am not clicking that link because no one should take the risk of you proving your point of what horrors could pop out of one of these models.
I will say that while the government backlash is inevitable just like it was with encryption, these image generation models are so easy to train on consumer hardware that the cat is hopelessly out of the bag. It might as well be thoughtcrime.
Or it's an output of "blank adobe photoshop with dialog refusing to edit bank note, full screen, windows vista, 4k, artstation, greg rutkowski, dramatic lighting".
In practice, it's unclear how well avoiding training on NSFW images will work: the original LAION-400M dataset used for both SD versions did filter out some of the NSFW stuff, and it appears SD 2.0 filters out a bit more. The use of OpenCLIP in SD 2.0 may also prevent some leakage of NSFW textual concepts compared to OpenAI's CLIP.
It will, however, definitely not affect the more-common use case of anime women with very large breasts. And people will be able to finetune SD 2.0 on NSFW images anyways.
The main reason why Stable Diffusion is worried about NSFW is that people will use it to generate disgusting amounts of CSAM. If LAION-5B or OpenAI's CLIP have ever seen CSAM - and given how these datasets are literally just scraped off the Internet, they have - then they're technically distributing it. Imagine the "AI is just copying bits of other people's art" argument, except instead of statutory damages of up to $150,000 per infringement, we're talking about time in pound-me-in-the-ass prison.
At least if people have to finetune the model on that shit, then you can argue that it's not your fault because someone had to do extra steps to put stuff in there.
> If LAION-5B or OpenAI's CLIP have ever seen CSAM
Diffusion model dont need any CSAM in training dataset to generate CSAM. All it's need is any random NSFW content alongside with any safe content that includes children.
So I definitely see an issue with Stable Diffusion synthesizing CP in response to innocuous queries (in terms of optics—-the actual harm this would cause is unclear).
That said, part of the problem with the general ignorance about machine learning and how it works is that there will be totally unreasonable demands for technical solutions to social problems. “Just make it impossible to generate CP” I’m sure will succeed just as effectively as “just make it impossible to Google for CP.”
It sometimes generates such content accidentally, yes. Seems to happen more often whenever beaches are involved in the prompt. I just delete them along with thousands of other images that aren't what I wanted. Does that cause anyone harm? I don't think so...
> I’m sure will succeed just as effectively as “just make it impossible to Google for CP.”
So... very, very well? I obviously don't have numbers, but I imagine CSAM would be a lot more popular if Google did nothing to try to hide it in search results.
I remember Louis CK made a joke about this, in regards to pedophiles (who are also rapists), what are we doing to prevent this? Is anyone making very realistic sex dolls that look like children? "Ew no that's creepy" well I guess you would rather them fuck your children instead.
It's one of those issues that you have to be careful not get too close to, because you get accused by proximity, if you suggest something like what I said before people might think you're a pedophile. So in that way, nobody wants to do anything about it.
The underlying idea you have is that the artificial CSAM is a viable substitute good - i.e. that pedophiles will use that instead of actually offending and hurting children. This isn't borne out by the scientific evidence; instead of dissuading pedophiles from offending it just trains them to offend more.
This is opposite of what we thought we learned from the debate about violent video games, where we said stuff like "video games don't turn people violent because people can tell fiction from reality". This was the wrong lesson. People confuse the two all the time; it's actually a huge problem in criminal justice. CSI taught juries to expect infallible forensic sci-fi tech, Perry Mason taught juries to expect dramatic confessions, etc. In fact, they literally call it the Perry Mason effect.
The reason why video games don't turn people violent is because video game violence maps poorly onto the real thing. When I break someone's spine in Mortal Kombat, I input a button combination and get a dramatic, slow-motion X-ray view of every god damned bone in my opponent's back breaking. When I shoot someone in Call of Duty, I pull my controller's trigger and get a satisfyingly bassy gun sound and a well-choreographed death animation out of my opponent. In real life, you can't do any of that by just pressing a few buttons, and violence isn't nearly that sexy.
You know what is that sexy in real life? Sex. Specifically, the whole point of porn is to, well, simulate sex. You absolutely do feel the same feelings consuming porn as you do actually engaging in sex. This is why therapists who work with actual pedophiles tell them to avoid fantasizing about offending, rather than to find CSAM as a substitute.
>The reason why video games don't turn people violent is because video game violence maps poorly onto the real thing
I don't believe this is the reason. By practicing martial arts which maps well to real life violence I do not see an increase of violent behaviour. Similarly playing FPS games in VR which maps much closer that flat screen games does not make me want to go shoot people in real life. I don't think people playing paintball or airsoft will turn violent from partaking in those activities. The majority of people are just normal people are not bad people who would ever shoot someone or rape someone.
>You know what is that sexy in real life? Sex.
Why is any porn legal then? If porn turned everyone into sexual abusers I would believe your argument, but that just isn't true. If it were true that a small percentage of people who see porn will turn into sexual abusers I don't think that makes it worth banning porn altogether. I feel there should be a better way that doesn't restrict people's freedom of speech.
"Artificially-generated CSAM" is a misnomer, since it involves no actual sexual abuse. It's "simulated child pornography", a category that would include for example paintings.
Hmm, that’s a good point. It seems to be able to “transfer knowledge” for lack of a better term, so maybe it wouldn’t need to be in the dataset at all…
I have no answer to this but I have seen people mention that artificial CSAM is illegal in the USA, so the question of whether it is better or not is somewhat overshadowed by the very large market where it is illegal.
They’ve ensured the only way to create CSAM is through old-fashioned child exploitation, meanwhile all perfectly humane art and photography is at risk of AI replacement.
This is a huge missed opportunity to actually help society.
Stable diffusion is able to draw images of bears wearing spacesuits and penguins playing golf. I don't think it actually needs that kind of input to generate it. It's clearly able to generalize outside of the training set. So... Seems it should be possible to generate that kind of data without people being harmed.
That being said, this is a question for sociologists/psychologists IMO. Would giving people with these kinds of tendencies that kind of material make them more or less likely to cause harm? Is there a way to answer that question without harming anybody?
Without the changes they made to Stable Diffusion, it was already able to generate CP. That's why they restricted it from doing so. It did not have child pornography in the training set, but it did have plenty of normal adult nudity, adult pornography, and plenty of fully clothed children, and was able to extrapolate.
Anyway, one obvious application: FBI could run a darknet honeypot site selling AI-generated child porn. Eliminate the actual problem without endangering children.
This isn't the case in law in many countries. Whether an image is illegal or not does not solely depend on the means of production; if the images are realistic, then they are often illegal.
Don't forget that pornographic images and videos featuring children may be used for grooming purposes, socializing children into the idea of sexual abuse. There's a legitimate social purpose in limiting their production.
Once I read an article about a guy who got arrested because he’d put child porn on his Dropbox. I had assumed he’d been caught by some more sophisticated means and that was just the public story. I’m amazed that anyone would be stupid enough to distribute CSAM through an account linked to their own name.
So your hypothesis is that if the FBI gives the database to a company it will inevitably leak to the pedophile underworld?
I can't judge how likely that is.
I guess I also don't care much as I only really care aboit stopping production using real children, simulated CSAM gets a shrug and even use of old CSAM only gets a frown.
What company? How is it that people are advocating for the release of this database yet nobody says to whom?
My (lol now flagged) opinion is that it’s kind of weird to advocate for the CSAM archive to move into [literally any private company?] to turn it into some sort of public good based on… frowns?
I regularly skimmed 4Chan’s /b/ to get a frame of reference for fringe internet culture. But I’ve had to stop because the CSAM they generate by the hundreds per hour is just freakishly and horrifyingly high fidelity.
There’s a lot of important social questions to ask about the future of pornography, but I’m sure not going to be the one to touch that with a thousand foot pole.
I've spent too many hours there myself, but I haven't seen any AI CSAM, and it's been many years since I witnessed trolls posting the real thing. Moderation (or maybe automated systems) got a lot better at catching that.
Now, if you meant gross cartoons, yes, those get posted daily. But there are no children being abused by the creation or sharing of those images, and conflating the two types of image is dishonest.
This comment is so far off it might as well be an outright lie. There hasn't been CSAM on /b/ for years. The 4chan you speak of hasn't existed in a decade.
What is the point of making it "as hard as possible" for people?
This not a game release. It doesn't matter if it's cracked tommorow or in a year. On open source no less, it's going to happen sooner rather than later.
As disgusting as it is but somebody is going to feed CP to an A.I. Model and that's just the reality of it. It's just going to happen one way or another and it's not any of these A.I. Companies fault.
Plausible deniability for governments. It's like DRM for Netflix-like streaming platforms. If they don't add DRM and their content owners' content gets pirated, they could argued in court that Netflix didn't do everything in their power to stop such piracy. So too here for Stability AI, they've said this is their reasoning before.
They don't. The training dataset though, may have been obtained through human rights violation. The problem is when the novelty starts to wear out. Then they will start to look for fresh training data which may again incur more human rights violation. If you can ensure that no new training data are obtained that way, then I guess it's okay? (Personally, I don't condone it)
Once again this does pose an interesting problem, though. The AI people claim no copyright issues with the generated issues because AI is different and the training data is not simply recreated. This would also imply that a model released by a paedophile generated out of illegal material would itself not be illegal, as the illegal data is not represented within the model.
I very much doubt the police will look at AI this way when such models do eventually hit the web (assuming they haven't already) but at some point someone will get caught through this stuff and the arrest itself may have damning consequences throughout the AI space.
They’re already out there, although they’re hard to find via Google - people are doing wild things like “merging” hentai models with models trained on real life porn to get realistic poses and lighting with impossible anatomy.
The scary thing is that you can then train it further with things like DreamBooth to start producing porn of celebrities… or, even more worrying, people you know.
Seriously folks, we are within a year or less of this being trivial. It’s already possible with a lot of work today.
I have no idea how it works but I have seen people talking about models trained to draw furry art. And I assume no one spent the millions on AWS to train a full model from scratch.
I believe what they do is take the released version of stable diffusion and then continue training from there with their own image sets. I came across their attempts when looking into how to train the model based on some images of my own; their data set so far reaches between tens of thousands and hundreds of thousands images.
All the difficult parts (poses, backgrounds, art styles) has already been done by the SD researchers, the porn network only needs reference material for the NSFW description/tags/details. This is significantly cheaper.
A similar project, training SD to output images in the style of Arcane, is incredibly successful in replicating the animation style with what seems to be very little actual training data.
I don't think you need to start from scratch at all if you use the SD model as a base, all you need to do is to train it on specific concepts, styles and key words that the original doesn't have.
Porn has driven many tech advances. I predict that models trained on specific porn genres will appear as soon as training a good model is doable for under $5000. They’ll get here much quicker if we get video to that mark first.
Yes and now. Yeah the VHS VS Beta situation was exaggerated, but you'll be surprised on how much on Netflix, and youtube UI tricks were stolen from innovation made in adult sites.
I'll even say that the high bandwidth push in the public was highly related to that. Even HTML5 video players, adult websites were faster to implement it than big streaming websites that still used flash or similar tech.
Even if porn is what you want, it's not clear that's what you want. It can probably generate better porn if it has a little context about what, say, a bedroom is.
What's more interesting, is that there's evidence (from public posts, I haven't tried these models myself) that models trained on some porn get better at non-porn images too.
No. The whole point of these models is that they combine information across domains to be able to create new images. If you trained something just on, say baseball, you could only generate the normal things that happen in baseball. If you wanted to generate a picture of a bear surfing around the bases after hitting a home run, you'd need a model that also had bears and surfing in the training data, and enough other stuff to understand the relationships involved in positioning everything and changing poses.
That's a really interesting point, and it makes me realize that the Nancy Reagan 'what constitutes porn' question is obviously super old and problematic.
Also lexica.art is swarming with celebrity fantasy porn that just has a thin stylistic filter of paintings from the 19th century. And a plethora of furry daddies that you can't not love.
I get why these models should be curated but I also like that the sketchy porn possibilities keep them feeling un-padded / interesting / dangerous.
Then again this all is probably really dangerous so maybe that's silly.
Open source is more than just everything being available. It also depends on the license, and the one Stable Diffusion uses doesn't qualify, for multiple reasons, including the one mentioned upthread.
I think the controls in this space are such a shit show right now that being "open model" is practically equivalent to a WTFPL.
If you're trying to build an app based on SD, then not being open source matters. But seems like the majority of use cases are just "I want to run the model locally". And at that point HF can't stop me from just ripping the Wi-Fi card out of my computer.
The easiest way to combat this is to put your model behind an API and filter queries (midjourney, OpenAI) or just not make it available (Google). The tradeoff is that you're paying for everyone's compute.
I guess SD is betting on saving $ on compute being more important in this space than the ability to gatekeep certain queries. And the tradeoff is that you need to do nsfw filtering in your released model.
It will be interesting to see who's right in 2 years.
Making a bulletproof filter is incredibly difficult, even more so in a topic where image descriptions are written in a culture that often has to circumvent text filters. Both midjourney's and OpenAI's filter works mostly because of the threat of bans if you try to circumvent them. I'm not sure I would describe that as "the easy solution"
I can't see any progress on AMD/Intel GPU support :( Would love to see Vulkan or at least ROCm support. With SD1 you could follow some guides online to make it work, since PyTorch itself supports ROCm, but the state of non-Nvidia GPU support in the DL space is quite sad.
It works with Pytorch -> torch-mlir -> MLIR / IREE -> vulkan. Works on both Windows and Linux. And has a simple gradio web UI https://github.com/nod-ai/SHARK/tree/main/web but we plan to enable better UI integrations very soon.
I dislike how they call their model open source even though there are restrictions on how you can use the model. The ability to use code however you want and not have to worry about if all the code you are using is compatible with your use case is a key part of open source.
I don't know why you're being downvoted. The model's license is unambiguously noncompliant with the Open Source Definition, yet they falsely claim it to be open source anyway. That's just as misleading as calling a product full of HFCS "sugar free" and saying it's okay because by "sugar", you just mean cane sugar.
The code is open source, the model is a data file that the open source code operates on. It's similar to engine recreations for old games (OpenRCT, OpenTTD) that use original, proprietary assets to play the games with their open source engines.
Similar to those games, anyone is also able to distribute their own open data files if they so wish
It's unlikely anyone actually will start training an open source AI model from scratch because doing so costs insane amounts of money, but the same can be said about the many hours of work recreating game assets can take for open source game engines.
I don't know what your point is. They use the terms "open source AI models" and "open source Generative AI models"
Yes, someone else could spend the millions of dollars to create a model that actually is open source, but shouldn't the people advertising their models as open source do that?
Should they distribute the data files according to the open source standards? Maybe. "Open" does not mean "open source", though; "open data" does not necessarily allow unlimited access and use of such data available, it's usually behind some kind of ToS document nobody reads and an API key. Applying open source expectations to anything with open in the name will often leave you disappointed outside the FOSS world.
Does not openly distributing their data files make their code any less open source? I don't think so. The code is open and licensed with a FOSS license. They spend time and money on creating a model and give the world the ability to replicate their model if it can collect the necessary funds. There are plenty of other open source projects that require vast arrays of server racks and compute power to be useful, that doesn't change anything about the openness of the code.
Awesome. I'm installing on Ubuntu 22.04 right now.
Ran into a few errors with the default instructions related to CUDA version mismatches with my nvidia driver. Now I'm trying without conda at all. Made a venv. I upgraded to the latest that Ubuntu provides and then downloaded and installed the appropriate CUDA from [1].
That got me farther. Then ran into the fact that the xformers binaries I had in my earlier attempts is now incompatible with my current drivers and CUDA, so rebuiding that one. I'm in the 30-minute compile, but did the `pip install ninja` as recommended by [2] and it's running on a few of my 32 threads now. Ope! Done in 5 mins. Test info from `python -m xformers.info` looks good.
Damn still hitting CUDA out of memory issues. I knew I should have bought a bigger GPU back in 2017. Everyone says I have to downgrade pytorch to 1.12.1 for this to not happen. But oh dang that was compiled with a different cuda, oh groan. Maybe I should get conda to work afterall.
`torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 30.00 MiB (GPU 0; 5.93 GiB total capacity; 5.62 GiB already allocated; 15.44 MiB free; 5.67 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF`
Guess I better go read those docs... to be continued.
GeForce GTX 1060 6GB, purchased literally 5 years ago. It worked with an optimized stable diffusion 1.0 so I was hopeful here. If I want to run these models going forward I guess I need something slightly more serious, eh?
It kind of annoys me that they removed NSFW images from the training set. Not because I want to generate porn (though some people do), but because I feel that they're foisting a puritan ethic on me. I don't consider the naked body inherently bad, and I don't like seeing new technology carry this (wrong, in my opinion) stigma.
Then again, it's their model, they can do whatever they want with it, but it still leaves me with a weird feeling.
But they aren't segregating them. They didn't release two models, one SFW and one NSFW. They segregated them before, with the filter you could disable, but now it's all SFW-only.
Eh, fine-tuning seems to work well enough that it can be added back in after.
Though, previous fine-tunings/textual inversions won’t work since the CLIP encoder has been replaced too. I’d be interested in knowing if it needs to be retrained too for this case.
I've seen references to merging models together to be able to generate new kinds of imagery or styles, how does that work? I think you use Dreambooth to make specialized models, and I think I got an idea about how that basically assigns a name to a vector in the latent space that represents the thing you want to generate new imagery of, but can you generate multiple models and blend them together?
Edit: Looks like AUTOMATIC1111 can merge three checkpoints. I still don't know how it works technically, but I guess that's how it's done?
It’s my understanding that, amazingly enough, blending the models is done by literally performing a trivial linear blend of the raw numbers in the model files.
Someone even figured out they could get great compression of specialized model files by first subtracting the base model from the specialized model (using plain arithmetic) before zipping it. Of course, you need the same base file handy when you go to reverse the process.
I thought so too until found that there are quite a bit of literatures nowadays about "merging" weights, for example, this one: https://arxiv.org/pdf/1811.10515.pdf and also the OpenCLIP paper.
From the Netherlands, never heard of it. Living in germany four years now, now that you remind me, I had to dig deep in my memory (at first I thought it might mean SS somehow), but yeah someone once mentioned it's the 8th character of the alphabet and so if you associate HH with the 2nd world war, you can read something into it. Most definitely not among the first associations for me, and believe me we had enough WW2 material in school (including kids that use hitler stuff to be funny). Perhaps 88 is specifically edgy in german high schools or so? I bet if you look at other cultures, it'll mean donkey balls or some such somewhere. I've also heard a german laugh about a 1312 license plate which I'd never think to associate with alphabet offsets in my life. Would be "ieiz" or "lelz" for me, if anything.
TL;DR very far fetched and a bit pointless to go looking for these non-obvious alternative meanings, in my opinion
You're entirely correct but also missing the 70-80 years of continued 'neo-nazi' history that has followed in the US. 1488, odd number associations with 88, and patterns like this are all verboten in the US. I've also seen this tattooed on people in Berlin.
How is this free? Is it possible to download the checkpoints?
I'm asking because I'm running SD locally but my GPU is not good enough to train new checkpoints and while I get the time to work on improve I wanted to use this API in order to generate some models for an illustration book I am working on.
It’s free because it’s on my research cluster, not in the cloud and I want to share it. For faster training it will be paid with other features, but to train a basic model will always be free. Im adding download of checkpoints now.
What's the potential of using this for image restoration? I've been looking into this recently as I've found a ton of old family photos, that I'd like to digitize and repair some of the damage on them
There are a lot of tools available, but I haven't found anything where the result isn't just another kind of bad, so if the upscaling and inference in this model is good, it should in theory be possible to restore images by using the old photos as the seed, right?
The limit for this sort of exercise is "holding everything in memory". Because training of neural networks require that one updates the weights frequently. An NVIDI A100 has a bandwidth of 2 Tb/sec. Your home ADSL something in the order of 10 Mbit. And then there's latency.
Mind you, theoretically that is a limitation of our current network architectures. If we could conceive a learning approach that was localised, to the point of being "embarrassingly parallel", perhaps. It would probably be less efficient, but if it is sufficiently parallel to compensate for Amdahl's law, who knows?
Less theoretically, one could imagine that we use the same approach that we use in systems engineering in general: functional decomposition. Instead of having one Huge Model To Rule Them All, train separate models that each perform a specific, modular function, and then integrate them.
In a sense this is what is currently happening already. Stable Diffusion have one model to generate img2depth, to generate an estimation which parts of a picture are far away from the lense. They have another model to upscale low res images to high res images, etc etc. This is also how the brain works.
But it is difficult to see how this sort of approach could be applied to very small scale, low contextual tasks, like folding@home.
You would likely be limited by the communication latency between nodes, unless you come up with some unique model architecture or training method. Most of these large scale models are trained on GPUs using very high speed interconnects.
The term for this is federated learning. Usually it’s used to preserve privacy since a user’s data can stay on their device. I think it ends up not being efficient for the model sizes used here.
Is there any place where we can learn more about all these AI tools that keep popping up, that is not marketing speak? Also, I see the words 'open' and 'open source' and yet they all require me to sign up to some service, join some beta program, buy credits etc. Are they open source?
> It is our pleasure to announce the open-source release of Stable Diffusion Version 2.[0]
> The original Stable Diffusion V1 led by CompVis changed the nature of open source AI models and spawned hundreds of other models and innovations all over the world. It had one of the fastest climbs to 10K Github stars of any software, rocketing through 33K stars in less than two months.
trained on is ~240TB and Stability has over ~4000 Nvidia A100 (which is much faster than a 1080ti). Without those ingredients, you're highly unlikely to get a model that's worth using (it'll produce mostly useless outputs).
That argument also makes little sense when you consider that the model is a couple gigabytes itself, it can't memorize 240TB of data, so it "learned".
Looks good. I've gotten bored with AI image generation these days however, after using a lot of SD the past few months. I suppose that's the hedonic treadmill in action.
Newbie question, why can’t someone just take a pre-trained model/network with all the settings/weights/whatever and run it on a different configuration (at a heavily reduced speed)?
Isn’t it like a Blender/3D studio/Autocad file, where you can take the original 3D model and then render it using your own hardware? With my single GOU it will take days to raytrace a big scene, whereas someone with multiple higher speced GPUs will need a few minutes.
It’s not totally clear what you are asking. The models are trained on something like an NVIDIA A100 which is a super high end machine learning processor, but inference can be run on a home GPU. So this is a “different configuration”.
But I think maybe you mean, can they make a model which normally needs a lot of RAM run more slowly on a machine that only has a little RAM?
It sounds like there are some tricks to allow the use of smaller amounts of ram by making specific algorithmic tweaks, so if a model normally needs 12GB of VRAM then, depending on the model, it may be possible to modify the algorithm to use 1/2 the RAM for example. But I don’t think it’s the same as other rendering tasks where you can use arbitrarily less compute and just run it longer.
The main limitation for running these AIs is that you need tons of VRAM available for your GPU to get any good performance out of them. I don't have a video card with 12GiB of VRAM and I don't know anyone who does.
If you're willing to wait more (30 seconds per image, assuming limited image sizes) there are repositories that will run the model on the CPU instead, leveraging your much cheaper RAM.
In theory you could swap VRAM in and out in the middle of the rendering process, but this would make the entire process incredibly slow. I think you'll have more success just running the CPU version if you're willing to accept slowdowns.
Eyeing the price graph of that 3060, it might be "commonplace" among the population that built a gaming PC in the last couple months, or went all-out in the past ~1.5 years (availability not taken into account).
Most people I know don't have a desktop in the first place, and on average I wouldn't guess that desktop users build a new one more often than once every ~4 years. And that's among people who build their own; if you buy pre-built, you have to spend a lot extra to get those top of the line specs.
It's possible to now go out and buy this on a whim if you have a tech job or equivalent salary, though.
Unfortunately the 3060ti, 3070 and 3070ti are limited to 8GiB, so it is certainly not common.
In the price range it is the only Nvidia card with 12GiB and the 3080 starts at 10GiB.
So you can certainly get a 12GiB card without spending 3080+ money, but if you want any more power than a 3060 and keep the 12GiB then you would need to spring for a 3080 12GiB which is a big jump in price.
If you use the provided pytorch code, have a modern CPU and enough physical RAM, you can do this currently. As you suggest, inference/generation will take anywhere from hours to days using a CPU instead of a GPU or other ML-accelerator-chip.
“Adoption” is a generous term to use for a description of Github stars (referring to the first graph). There’s no denying stable diffusion has been gaining popularity, but I think it’s hard to say it’s really being adopted at the same rate it’s getting starred on Github.
Speaking of business models for AI, and the fact that stable diffusion is anti-trained for porn. Somebody with an old terra byte image porn collection right now: "Hold my beer, my time has come!"
One thing I’m wondering is what kind of different applications it can be used. Maybe there will be new experiences in the fashion industry like people can train their cloth designs and see how it looks on people. Maybe they don’t need to hire models to do the modelling?
I've also built a similar service[1] that does this with inpainting instead of textual inversion, so it preserves the face exactly and returns in seconds, not hours.
This is great. I built https://phantasmagoria.me because I was excited about this and wanted to make image generation more accessible. I can't wait to see what kinds of images v2 will enable.
Well darn. This is an awesome leap, but I've spent the last few months making a card game using Stable Diffusion art and I guess now I need to go back and go over everything again. Congratulations to the SD team on another wonderful step forward!
This one's publicly downloadable? I think I must've missed 1.5. it had been postponed for a while (for good reasons discussed throughout threads here) and I didn't notice whether it had been released.
depth2img looks really interesting. I was thinking that someone should train an art model like SD on 3d models+textures. This isn't quite that but it seems like it gets some of that effect.
The repo provides a chart with a quantitative measure (FID/CLIP score) where the new 2.0 models do indeed have much better results than the earlier 1.5 model.
I suspect that, if many of the people whining about copyrights in the context of generative AI got their way and made this usage a violation, they wouldn't be happy with the knock-on effects.
I would love that…I’ve heard of this use case for AI described many times but I’ve yet to find anyone doing it! Copilot is great, but for more creative/frontend work it seems like there should be something right?
Stable Diffusion is amazing at generating art. Something similar but specialized in UI could be too. Maybe one could make a custom model, but with my lack of design knowledge I’m not even sure where to start…
It would surely save my monkey brain from pouring many more hours into looking at existing websites/UI libraries/Dribble and drawing inspiration (copying) from them.
Thanks for your premature concern, but we'll be fine. Despite how it may appear to a layperson such as yourself, the value of human creativity is in no way diminished by the release of this tool or others like it.
Ok. The irony. Actually, after 20+ years in the tech industry, I will say this:
Your beloved corporations don't have a metric called “creativity”, they have a bottom line, and she has all the powers.
I am an artist by education and can confirm that creativity is overrated, the processes that artist follows and repetition towards a given goal deliver the results.
Whatever feelings or ideas you have, the actual craft is the medium in which you will deliver.
Reducing *The Path* to text input is not an artistic or craftsmanship process.
There is no creativity involved. May be, someone with more knowledge about the real process and broader visual culture will make more aesthetically right choices. But this can be automated too.
This is not a “tool”, like Photoshop. This is something else.
And all of you know this.
More than 50 percent of frontend code is boilerplate. CRUD apps follow similar logic.
Why not automate this repetitive processes first?
No. Corporations are starting the automation from the lowest risk crowd—the digital artists, they have low representation, no coherent community and are always ready to sell themselves for pennies.
Now they will compete with the machines.
And your time in this battle will come. Soon.
To put things in perspective, the dataset it's trained on is ~240TB and Stability has over ~4000 Nvidia A100 (which is much faster than a 1080ti). Without those ingredients, you're highly unlikely to get a model that's worth using (it'll produce mostly useless outputs).
That argument also makes little sense when you consider that the model is a couple gigabytes itself, it can't memorize 240TB of data, so it "learned".
I've been looking around the documentation on Huggingface, but all I could find was either how to train unconditional U-Nets[1], or how to use the pretrained Stable Diffusion model to process image prompts (which I already know how to do). Writing a training loop for CLIP manually wound up with me banging against all sorts of strange roadblocks and missing bits of documentation, and I still don't have it working. I'm pretty sure I also need some other trainables at some point, too.
[0] Specifically, Wikimedia Commons images in the PD-Art-100 category, because the images will be public domain in the US and the labels CC-BY-SA. This would rule out a lot of the complaints people have about living artists' work getting scraped into the machine; and probably satisfy Debian's ML guidelines.
[1] Which actually does work