> I’ve always had difficulty generating a normal Sonic the Hedgehog image with AI image generation. DALL-E 2, for example, just flat-out can’t do it.
IIUC, OpenAI specifically deny-listed trademarked characters from the training dataset to try and side-step getting sued by companies with enough resources to move the needle on what was allowed and disallowed in terms of digital art creation (since a ruling could disrupt their entire business model).
I assume Emad Mostaque did the risk calculus differently when they open-sourced Stable Diffusion (I don't know for sure, but it smells like his attitude on the question was "I don't care because now that it's open-source nobody can delete it from the Internet anyway").
If Sonic the Hedgehog was specifically excluded from the dataset, DALL-E 2 has no way to know the hedgehog should be blue. Instead, I think it's interpreting "a portrait of Sonic the Hedgehog" as a hedgehog that happens to have a name, and almost all hedgehogs, regardless of their name, are the same color.
Even if someone had added a blue realistic-looking hedgehog to the dataset, if they referenced Sonic their image would have been deny-listed. So the only way to get there is to add the adjectives one is looking for, such as `a portrait of a blue hedgehog wearing red sneakers and eating a chili dog`... And, indeed, when I try that prompt, I definitely get some Sonic-alikes.
Anyone else getting bored of generating images? I used Stable Diffusion even before the public release (there was a guide on 4chan with leaked weights) and I'd been using it for a few weeks. I just feel like I've generated all the images I want to for now, and I just don't have any more interesting concepts I want to explore anymore. When you have so much variety in images, I got acclimatized to them and they all became the same and uninteresting after a while.
It's like a hedonic treadmill but for AI image generation. I assume this will have the most usage as a once in a while tool for artists to get inspired from, and also a tool for commercial usage such as in a Photoshop or Figma plugin for designers. The lay person who wants to generate images will get bored after a while.
The first human who discovered that they could draw something on a wall probably also got bored after a while.
And thought "I assume this will have the most usage as a once in a while tool to decorate my cave with funny animals".
But then drawing animals turned into writing. Turned into printing. Turned into emails and the web and the web 2.0 and now here we are, sending our thoughts across the world with the speed of light. Using software that we wrote to tell computers how to do this for us.
It might be similar with AI driven image generation. That it evolves in ways that are currently hard to foresee. Maybe one day, our thoughts are translated in realtime into images and send across the galaxy with the speed of light. Or faster?
How cool will it be when we can tell a program, "I want a fantasy story that continues on Lord of the Rings and details Pippin and Merry's second great and wonderful adventure after Frodo left middle earth" and get a 300 page story that is well-written and inventive and fun to read.
I know the requirements for something like that are extraordinarily high, but I could see it in 30-40 years being viable.
> When you have so much variety in images, I got acclimatized to them and they all became the same and uninteresting after a while.
IMHO, the results of these image generators also tend to be pretty mediocre. Not terrible, but like those Beeple NFT images: something made by someone of middling talent without inspiration, mainly as an excuse to use tools.
Also, when I played with stable diffusion specifically, the stuff it generated frequently had a horror-show quality, because it has no idea about stuff like how many legs people have.
I wonder if these generators will plateau, because the "throw more training data at it" technique will be undermined by mediocre-to-poor AI generated images.
I'm watching this space like a hawk. Though I code for a living it's too different from this work and I can't personally do the stuff I need with it, but that stuff is continually happening (if I wasn't based on M1 Mac I might be a little farther along, but I think I'd still be running into obstacles based on how technical all this is)
I've put out an album, pretty recently, based on my collaboration with a modular synthesizer running systems that let it aleatorically generate chords, key changes etc. and you probably would not know the 'machine' wrote the chords for the whole album.
The place where Stable Diffusion becomes interesting for me is when I can feed my own styles and objects into it. Right now, I don't think that's facilitated for me on my M1 Mac in my relatively nontechnical world: I can run DiffusionBee, and explore the ranges of the dataset's collective visual unconscious. I'm learning how to do much more interesting things than 'trending on Artstation, 8k, etc etc etc'.
I have curated collections of images to which I can apply my own language cues, and a 440 episode webcomic that I could annotate the hell out of (my OWN art, not currently even on the internet), and the ability to feed not just 4-5 images but dozens, hundreds of images into the machine.
At that point, it's a private visual imagination that becomes ME diffusion, and I don't have to tell it 'trending on artstation' anymore. Hell, I could teach my copy a whole set of associations based on just feeding its own output back into it, using my own intuition to associate not-generally-useful concepts like 'cold' or 'loud' or 'disappointed' considered as visual abstractions.
If I can reliably associate 'anticipatory' with visual stimuli, I can begin using it as direction for my own use of SD, telling it that I want this panel more 'anticipatory' or less. If I can feed in a language of panels and borders that has variety and I'm able to associate it with language, I can use SD to generate comic panels with the associations 'unsettling' or 'normal' or 'dramatic' and composite them into final output.
Bear in mind that engaging in this behavior means ME feeding in the associations that are relevant to ME as an artist, effectively making an auxiliary visual subconscious much like I made a modular synth into an auxiliary musical composition subconscious.
No, I'm not bored. If you're bored, maybe you're not an artist? Or maybe you don't have a firm intention and motivation towards which to direct your art?
These things should be like a violin. You can give one to any shmoe off the street, but ability to perform on the thing does not come along with just picking it up and plunking at it. I'm convinced that in order to make visual AI a tool you absolutely must let the artist feed in their own associations, concepts, objects etc. and then direct the output towards their own ends.
I completely agree with your sentiments and enthusiasm.
personally, I've been waiting a long time for AI to get to the point where I can generate my own animated cartoons. I can see the pipeline for it now: sketch concept art, mess around with inpainting, use textual inversion to build a "dictionary" of assets like character art, scenes, objects, art styles, use that + simple sketches to storyboard, then put the script in and get keyframes, then tween.
and even if the result looks mediocre or a bit weird (for now), the sheer power of solo-animating an entire series is just.. tantalizing.
Stable Diffusion is wild - the space has been quickly developing and watching the pace of development makes me reconsider what I consider "staggering". I've been blown away. The accessibility of this technology is even more incredible - there's even a fork that is working on M1 Macs (https://github.com/lstein/stable-diffusion)
We are in for some interesting times. Whatever the next iteration of Textual Inversion is will be extremely disruptive, especially if the concepts continue to be developed collectively.
Yeah, textual inversion is amazing paired with SD.
I recently trained it on NFTs to generate variants of bored ape and punk style art.
The only problem right now is the variants are not consistent and it's hard to tell stable diffusion to make only slight variant changes with some mask and editing.
You can mask out the top head of the NFT punk and stablediffusion can generate different heads but that's fairly limited in the end result.
I think a cloud service which can automate the training and store textual inversion models would be a really cool startup. Ping me if you want to build this together.
I've been experimenting with it to spit out commercial style illustrations and stock photos. It's a lot of manual work with Google collab and frustrating to try.
It turned out to be a simple way to communicate to the model "I want this to be similar to high quality images only, instead of all kinds of shitty art you've seen".
Is there an online service where I could use the features of Stable Diffusion to create images? I don't think my gtx970 meets the specs needed but I am a total noob and not sure. I have been dying to try it out however and would like to find a way to make a few images. If anyone knows of any services I would be thankful for a link. Thanks
How many images do you need to get a good textual inversion? If you were to draw a custom character in a few poses, could you then generate infinite new styles and poses of that character? Or does it have to be something already common in the training dataset that you are just putting a very specific noun to?
If someone played around with textual inversion: what happens if I give it hundreds of images? Does it plateau, does it regress, and if it regresses, what is the issue exactly?
Good stuff, Max. Wish I had time to fiddle with local SD but I'm not worried about it, it just keeps getting better and more accessible. The leading edge will always be experimenters but the trailing edge will be creative people using the tools to further their art.
Gonna have to pick up one of those cheap RTX 3090s that miners are getting rid of so I can try this myself. With 8GB I'm just getting out of memory errors.
You can 100% run this on an 8gb card. Make sure to load the model weights in half precision (and optionally enable attention slicing). E.g., using the diffusers library [1]:
from diffusers import StableDiffusionPipeline
from torch import autocast
pipe = StableDiffusionPipeline.from_pretrained(
"CompVis/stable-diffusion-v1-4",
revision="fp16",
torch_dtype=torch.float16,
use_auth_token=True
)
pipe = pipe.to("cuda")
pipe.enable_attention_slicing()
prompt = "a photo of an astronaut riding a horse on mars"
with autocast("cuda"):
image = pipe(prompt).images[0]
That fork released around when SD released, I've been wondering if anyone's integrated those optimizations into some of the more prominent forks that are floating around
Does it work with textual inversion? I've got Stable Diffusion set up and its works fine, but I get out of memory error on the textual inversion feature.
No, not currently. Textual inversion requires training the model, which requires gradients to be computed and stored in VRAM, which doesn't happen for normal inference, so larger than usual VRAM is required.
Textual inversion is teaching it a new concept, such as "Ugly Sonic", from a couple images. Then you get a file you can use in SD to use this new concept in prompts.
I was hoping an online service would pop up where you can use someone else's hardware for a small fee and generate images. I really want to give it a try but only have a gtx970 so don't think it will work.
> I was hoping an online service would pop up where you can use someone else's hardware for a small fee and generate images.
This one charges but gives pretty good results at 512x512 - and in only a couple of seconds at that resolution. For more logo/cartoony stuff you can generally get a 10 step done in less than a second.
When I signed up a few weeks ago they gave 200 credits free (IIRC). After that it is $10 for another 1,000 credits (a credit gives 5 images at 10 step 512x512 or 1 image at 50 step 512x512).
Only thing to note is that after a while it slows down, but I realised that is because every image generated is going into an array in local storage in the browser. Using the browser's Inspect -> Application area to clear that array every 100 images or so sorted that out.
Places remote computer providers like Vultr, Digital Ocean, Linode, etc., have servers with GPUs you can run up. You pay for uptime, and storage (including machines that are not up).
Got the grado webui running (unstably, thx to a memory leak) on my GTX 1050Ti 4GB; you /can/ do it! Be sure to check-out the "optimizedSD" repo if you follow up
Just an aside, if you want to see Ugly Sonic resurrected, just watch the new Chip N Dale movie... It's actually an alright film, and the fact that they put ugly sonic in it was pretty amazing
> Indeed, there are many images of Sonic in the training dataset, however the generated images do not verbatim reproduce or otherwise plagiarize results from the training set above (I checked each one).
This understanding of plagiarism reminds me of what students tell me when I ask them why they thought they could paraphrase an uncited Wikipedia article as their paper, right before I report them to the dean's office.
Plagiarism in art is more complicated than plagiarism in a thesis paper, especially since art is a creative field defined by inspiration and iteration.
I added that line because reproduction of the input dataset into an AI is a valid concern (such as reproducing a Getty Photos watermark), but that isn't happening here.
I'm eager to see if "AI" can break out of these local minima and actually live up to its promise, or if the end game is really just making cool looking pictures that get fleeting internet attention
I've always been bothered by app names getting invaded by keyword vomit.
"Firefox, fast and free"
It feels like going to get a movie, and the title is "Casablanca, super attractive lead actor" or something. It just cheapens the value of the work like it's a low budget film from the adult section.
But what weirds me out is how much of that is needed to get a desired result. "Unreal engine 4k resolution".
What happens when 4k isn't enough? What about twenty years later when all our buzzwords are meaningless, and unreal engine no longer exists? Or what if AI appropriates the word, and that becomes the new definition?
It hasn't got anything to do with the keywords actual meaning really... all of this is just to map the 7000-dimensional internal description space in SD to something we meagre humans can input and output. You can leave it free, but then you have to accept the result can be anything. Normally you do want to leverage some control of the process.
The keyword "4k" doesn't "mean" anything to the model, it maps to the internal space through the language model depending on its training on captions earlier. You could just as well have used any other way of assigning some parts of this 7000-dimensional vector, it just turns out these keywords are usable shortcut, just like specifying a camera model or an artist name in essence acts as a "macro" to the internal space.
I'm reminded of the hackernews who said that their 3-year-old daughter used "don't forget to like and subscribe!" as a form of goodbye e.g., when relatives left the house.
Right right. It just seems like a way that could evolve language in a dystopian way.
If 4k doesn't really "mean" anything to the algorithm, will it still have meaning for us? Or will our usage of the word start to reflect how the algorithm interprets it?
Just like YouTube face. It starts out as us influencing the algorithms, it ends with the algorithms influencing us.
No, because these style prompts are just dumb interfaces to a model. They're very stupid and we will not be using them like this in the (likely near) future. There are many other ways one could query a model for stylistic modifications.
Yeah there is already textual inversion and other ways to "prompt" the models coming up. I do think pure human-level text will stay (and evolve) as a way though. Stability AI already released an updated CLIP model (the language model used in many of these) for example, also the model in the current SD was intentionally made simple so the whole system could fit in consumer GPUs VRAM, it was not state of the art.
There might be some interesting statistical analyses possible at the end of the day btw, like trying to figure out how many commonly widespread "artistic styles" there are in the world's total of art and photography so far. After all that is what a lot of the prompt engineering is about. Sort of a principal component analysis of the dominant 100 styles or something...
I wouldn't worry about it. You could try asking a random person in a bar what they think "4k" means, I'd bet a decent portion of people would say something along the lines of "high quality", and very few would give you the precise pixel count.
4k doesn’t have a precise pixel count. If you’re bringing precision into the picture then the reality is 4k is a family of resolutions around 4000 pixels wide. UHD has a precise defined size.
It doesn’t have a precise meaning to this model though - because everything it trained on was downscaled anyway, and because an image on the internet with “UHD” in the caption wasn’t necessarily UHD.
"Trending on artstation" is even more strange. I do think these text prompts will outlive their original context; like the floppy disk "save" icon or the "hang up" icon.
The generated images are rarely larger than 512x512 pixels. The prompt says "4k" to bias the style of generated images toward the kind that are labelled 4k in the input data set.
For now, we use prompts to explore latent space. I don't know that we'll always use prompts. They're better than trying to carefully twiddle hundreds of dimensions manually, but I can't help but think there's a better, more tunable approach yet to be invented.
There are ways to nudge out dimensions with specific meaning (or create dimensions with specific meaning, since you can have a exactly equivalent space with some linear transform) - I recall seeing research for adjusting face generation models to have explicit sliders for certain categories like length of hair, masculinity, shape of nose, etc; so it should be reasonably straightforward (as in, require a bunch of work and tuning but no breakthroughs) to have a specific explicitly tunable "high-res vs low-res" or "detailed vs sketch" or "photorealistic vs drawn" parameters in addition to the prompt.
There may be something much better, but in many places it's difficult to come up with a form of communication more efficient for the user than textual language. It could be a very tall order to do that in this area.
The best way to think of prompts are as somewhat equivalent to search engine queries. Perhaps the most fascinating part of these transform models are how they resemble the process of memory recall and association.
In a 4GB space is embedded potentially millions of images to tokens which correlate with language descriptions. So indicating the desired result somewhat simulates this process of recall and association.
The process lacks any internal emotional motivation to construct any "desired" outcome that a human would, so it needs both a "seed" number and description of the desired recall elements to achieve a specific result.
> Or what if AI appropriates the word, and that becomes the new definition?
Or what if it appropriates your name as a style, and puts you out of business?
It's like the old prediction that well-known actors could sell their faces and voices for producers to CGI up new films with, except you don't pay the actor.
IIUC, OpenAI specifically deny-listed trademarked characters from the training dataset to try and side-step getting sued by companies with enough resources to move the needle on what was allowed and disallowed in terms of digital art creation (since a ruling could disrupt their entire business model).
I assume Emad Mostaque did the risk calculus differently when they open-sourced Stable Diffusion (I don't know for sure, but it smells like his attitude on the question was "I don't care because now that it's open-source nobody can delete it from the Internet anyway").