Hacker News new | past | comments | ask | show | jobs | submit login
A Web UI for Stable Diffusion (github.com/automatic1111)
284 points by feross on Sept 9, 2022 | hide | past | favorite | 143 comments



This is the one I've been using https://github.com/sd-webui/stable-diffusion-webui . docker-compose up , works great.


I've also been using this one (wasn't sure at first, they just migrated from the /hlky/ namespace on GitHub), but I have no idea at first glance what the differences are.

I will say that this one has had REALLY active development as new features have been coming out, and is pretty polished at this point (albeit I'm using it more as a toy than anything, but it's awesome to have a quick way to use the new features that have been shipping out).


Seems like it fails if you have an AMD GPU instead of an Nvidia one (at least that's my guess, based on the error contents):

  ERROR: for stable-diffusion  Cannot start service stable-diffusion: failed to create shim: OCI runtime create failed: container_linux.go:380: starting container process caused: process_linux.go:545: container init caused: Running hook #0:: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: initialization error: load library failed: libnvidia-ml.so.1: cannot open shared object file: no such file or directory: unknown


Indeed, Stable Diffusion does not currently run on AMD graphics cards.


I'm able to run it on my rx6900 xt on Linux.

I tried using a docker container and it took 3 min to generate a prompt. However it seems that 2:45 min is somehow spent on tje GPU and finally the remaining 15 seconds the GPU gets utilized.

I haven't had the time to look into this yet, but it does seem to work.


Some people have been able to run on recent AMD cards with rocm: https://github.com/CompVis/stable-diffusion/issues/48


It does, on Linux, Windows, and MacOS.


I've gotten it running with a Radeon RX 6800 on Ubuntu Linux 22.04 (with overwriting PyTorch with a ROCm-supporting version), and on Windows 10 (in a very barebones way using ONNX), but are there better, more full-featured ways to get it running on Windows? Would love to know.


Is there a way to run this in the cloud? On Google Colab or elsewhere?


Colab: https://colab.research.google.com/github/WASasquatch/StableD...

To run it elsewhere in the cloud, grab a GPU (spot) instance and SSH in.


You mean a VM on a machine with a GPU? Or does it have to be a bare metal machine? What is a good provider of suitable VMs/machines?

And what do you do after you SSHed in? The installation instructions seem to be for windows users (click here, then click there ...) is there a linux script that does the installation automatically?


Just a VM with a GPU, doesn't need to be bare metal. AWS/GCP/Azure has em, but for GPU cloud instances, boutique vendors like CoreWeave, runpod, lambdalabs.com, vast.ai, paperspace may be more competitive.

Parent comment alludes to docker-compose up.


You can run docker inside a VM? And it will be able to use the GPU and tunnel port 80 through the docker container, through the VM and to the web?


There's some trickery and details, but yes: https://github.com/NVIDIA/nvidia-docker


You can also get windows instances with GPU attached and RDP in, though personally I prefer Linux

https://docs.aws.amazon.com/AWSEC2/latest/WindowsGuide/accel...


Here's an install script if you need handholding :) https://github.com/JoshuaKimsey/Linux-StableDiffusion-Script...


Lambda Labs is probably cheaper: https://lambdalabs.com/service/gpu-cloud


I like this one but had some trouble with using img2img. Maybe my image was too small (it was smaller than 512x512). Failed with the same signature as an issue that was closed with a fix.


I am mobile, but there's an issue reported on github about img2img


So, and this is an ELI5 kind of question I suppose. There must be something going on like "processing a kazillion images" and I'm trying to wrap my head around how (or what part of) that work is "offloaded" to your home computer/graphics card? I just can't seem to make sense of how you can do it at home if you're not somehow in direct contact with "all the data?" e.g. must you be connected to the internet, or "stable-diffusions servers" for this to work?


You can think of it more like this: If I do 100 experiments of dropping stones at variable heights and measuring the time it takes for the stone to land on the ground I have enough datapoints to make a linear estimation of gravity by using linear regression. So based on my data I create a model that the time it takes for a stone to fall is sqrt(2h/9.81). Now if you want to figure out how long it takes for your stones to fall, you don’t need to redo all the experiments and can instead rely on the parameters I give you (say 9.81 in this case) to calculate it yourself.

With these models it works exactly the same way. Someone dropped millions of rocks and created a formula of unbelievable complexity and what they now did is they released that formula with all their calculated parameters into the world. What you do when you ultimately use Stable Diffusion is you just calculate the result of this formula and that is your image. You never have to process those images.


This is exactly it. It’s pretty remarkable that it was trained on over 100 terabytes of images and yet the model has been distilled down to only 4gb.


Yes, and another reason for the small model size and the novelty of the underlying paper [1], is that the diffusion model is not acting on the pixel space but rather on a latent space. This means that this 'latent diffusion model' does not only learn the task at hand (image synthesis) but in parallel also a powerful lossy compression model via an outer auto encoder structure. Now, the number of weights (model size) can be reduced drastically as the inner neural network layers act on a lower dimensional latent space rather than a high dimensional pixel space. It's fascinating because it shows that deep learning at its core comes down to compression/decompression (encoding/decoding), with close relation to Shannon's Information Theory (e.g. source coding/channel coding/data processing inequality).

[1] https://arxiv.org/abs/2112.10752


Oh, wow. Now that you mention how it's similar to lossy (if not the same as) compression it all makes a LOT of sense. This is great. I teach IT and I already do a bit on how lossy compression works, (e.g. hey, if you see a blue pixel and then another slightly darker one next to it, what's the NEXT likely to be?) and this is something of an extension of that.


Correction: the auto encoder is pre-trained :)


Then maybe we should remind about this 25,000:1 ratio when an artist complains about his copyrights being abused. The model doesn't have space to actually copy his works inside, it can only memorise the equivalent of a thumbnail from each input. A very small thumbnail, scaled down 150:1 per width and height (square root of 25000). That's like a grain of rice on the screen.


That's not how it works though. Instead of applying arbitrary content detail reduction, the model is an attempt to distill the core of what makes a particular artist (or phrase, face, object etc) unique.

When programming, it will often take a long time and a lot of code to get to a few final lines that do what you want. You cannot say the final result is a "thumbnail" of all previous efforts. Rather, it is the apotheosis of it.

Some artists spend decades developing a style that looks like a kid could do it as well. Still, there is something unique in there, that a trained eye will recognize. Converting that particular style to a formula and making that freely available is at least somewhat morally ambiguous.


It's the same as someone trying to mimic a style. Nothing wrong with that. Certainly not something you could get copyrights from.


It is not the same as trying to mimic a style. It is cloning the essence of a style and making it readily available to anyone who asks for it.

Sure, it's not copyright infringement, but you could argue that this takes away from the hardship the original artist had to go through to perfect their style.


> Sure, it's not copyright infringement

Which was precisely what the discussion was about.

> But you could argue that this takes away from the hardship the original artist had to go through to perfect their style

You could argue the same things about photoshop, a lot of other digital tools, drum machines, the photograph and the phonograph.


Ah, I can step in here.

Fair use might work but maybe not? If I were to argue against it, I'd probably compare something like a recording of music vs. a MIDI file. Same raw data scaling.


That’s the interesting part: all the images generated are derived from a less than 4gb model (the trained weights of the neural network).

So in a way, hundreds of billions of possible images are all stored in the model (each a vector in multidimensional latent space) and turned into pixels on demand (drived by the language model that knows how to turn words into a vector in this space)

As it’s deterministic (given the exact same request parameters, random seed included, you get the exact same image) it’s a form of compression (or at least encoding decoding) too: I could send you the parameters for 1 million images that you would be able to recreate on your side, just as a relatively small text file.


So it's like a compiler which produces a 4GB executable file? And that 4GB is all the "logic" which can produce infinite possible images?


Not exactly. There’s no real logic per se, just data. It’s made up of tons of floating point numbers that define relationships to other floating point numbers.


> As it’s deterministic (given the exact same request parameters, random seed included, you get the exact same image) it’s a form of compression (or at least encoding decoding) too: I could send you the parameters for 1 million images that you would be able to recreate on your side, just as a relatively small text file.

For any input image? Or do you mean an image generated by the model?


I meant images generated by the model. Now that I think of it I could just send you the sampled vectors and you could feed that to the vector to image part.


My understanding is that images will not be bit identical due to GPU physics and decimal precision. Images from the same seed may be for all practical intents and purposes indistinguishable - but there are some flipped bits involved.


That's not my understanding. The same seed value to the device's random number generator should results in the exact same outputs - there's a bug being chased down in the MPS (MacOS) backend where the fixed random seed doesn't output the same image on different computers.


I've heard something a bit in between what you're both saying. For the same machine with the same seed / parameters [0], your output is deterministic. But once you change hardware or OS you will probably get bit-level differences that won't make a macro-level difference.

No idea how true that is, but on my windows machine, same params/seed is definitely deterministic.

[0] a help string in the SD source code recommends the ddim_eta parameter (which isn't exposed in most web UI or GUI's, including the OP github) stay at the default 0.8 for deterministic sampling. I have no idea if this means changing the value from 0.8 produces non-deterministic results with the same hardware/os/params/seed. Or if they just mean changing this from 0.8 will make your SD not match the online model but still be deterministic itself. But in my testing, changing this value gives no useful changes to the image generation, so I keep it at 0.8


If floats are used than there is no absolute deterministic behaviour accross different machines. It can never be guaranteed.


You could input an image and get it to recreate it as best as possible and then output a seed. That would be interesting!


This is a fascinating idea. Have StableDiffusion generate an image from the image you'd like to "compress" + a random seed. Feed that output to an adversarial network that compares source image to output and scores it. Try again with new seed.

After running for a while, the adversarial network outputs a seed, and you now have a few characters representing a reasonable approximation of your image.


I expect something after jpegXL will be a neural network based compression scheme, where the client has a n GB neural net attached. There have been several that already show promising results (it's likely to be more of a standards issue than a technical issue).


In 80s there was a man (forgot his name) who claimed that one day you could store an entire high res movie on a floppy disc. One day he might be right when AI can regenerate sequences of seeds to images/video. You just need a petabyte of models stored somewhere.


This is the main reason why attempts to say that these sorts of AI are just glorified lookup tables, or even that they are simply tools that mash together a kazillion images together are very misleading.

A kazillion images are used in training, but training consists of using those images to tune on the order of ~5 GB of weights and that is the entire size of the final model. Those images are never stored anywhere else and are discarded immediately after being used to tune the model. Those 5 GB generate all the images we see.


All those 'kazillion' images are processed into a single 'model'. Similar to how our brain cannot remember 100% of all our experiences, this model will not store precise copies of all images it is trained off of. However, it will understand concepts, such as what a unicorn looks like.

For StableDiffusion, the current model is ~4GB, which is downloaded the first time you run the model. These 4GB encode all the information that the model requires to derive your images.


SD has 860M weights for the main workhorse part. At 16-bit precision that is only 1.6 GB of data, which in some very real sense has condensed the world's total knowledge of art and photography and styles and objects.

It's not a search engine, it's self-contained and the closest analogy is that it's a very very knowledgable and skilled artist.


Is there a smaller version of the model available (<4gb) intended for use with 16 bit precision?


Diffusers shows how to use the fp16 variant.

https://github.com/huggingface/diffusers


What you interact with as the user is the model and its weights.

The model (presumably some kind of convolutional neural network) has many layers, every layer has some set of nodes, and every node has a weight, which is just some coefficient. The weights are 'learned' during the model training where the model takes in the data you mention and evaluates the output. This typically happens on a super beefy computer and can take a long time for a model like this. As images are evaluated the output gets better the weights get adjusted accordingly.

Now we as the user just need the model and the weights!


It’s all offline in 4gb file on your local computer. It’s like mini brain trained to do just one/few specific tasks. Just like your own brain doesn’t need Wi-Fi to connect to global memory storage of everything you experienced since birth, same way this 4gb file doesn’t need anything extra.


A kazillion images are used to create/optimize a neural network (basically). What you're working with is the result of that training. These are the "weights"


As someone with ~0 knowledge in this field, I think this has to do with a concept called "transfer learning" in which you once train with that kazillion of images, then use that same "coefficients" for further run of the NN.


Nah, transfer learning is when you take a trained model, and train it a little more to better fit your (potentially very different) problem domain. Such as training a cat/dog/etc recognition model on MRI scans.

The goal is usually to have the more fundamental parts of your model already working and you thus need way less domain specific data.

Here, you're not training anything, you're running the models (both the CLIP language model and the unet) in feedforward. That's just deploying your model, not transfer learning.


It can be run directly into google colab: https://colab.research.google.com/drive/1Iy-xW9t1-OQWhb0hNxu...


When I run it, I get "Your session crashed for an unknown reason."


Just click runtime -> run all again. There’s a weirdness where Python’s loader gets confused and the most effective fix is to crash the interpreter


Looks great but I use Linux and the README is fairly Windows-centric without warning. It'd be nice if there were clearly deliniated sections for Windows vs *nix.

There's a (very ironically named) "Manual installation" section which might seem to be the answer for Linux, but then it's not immediately obvious which preceding sections are Linux or Windows without doing critical thinking.


I’m waiting for someone to wrap this up into a desktop app that I can install and run on my Mac.


I've been looking into this for the last 2 days. Unless you're running an M1 Mac or newer, you're SOL.

Stable Diffusion is built on PyTorch. PyTorch mainly has been designed to work with Nvidia cards. However PyTorch added support for something called RocM like a year ago that adds compatibility with newer AMD cards.

Unfortunately RocM doesn't support slightly older AMD cards in conjunction with intel processors.

So my 32gb pretty powerful 2020 16in MacBook Pro isn't capable of running Stable Diffusion.

Any native app will likely have to rely on a remote cloud gpu. And boy, those are fucking expensive. Been researching what I need to stand up a service the last few days and it isn't cost friendly.


> I've been looking into this for the last 2 days. Unless you're running an M1 Mac or newer, you're SOL.

And not just any old M1 Mac. Last week I got it running on my 2021 8GB M1 MacBook Air and it's slow. Images at 512x512 with 10 steps take between 7 and 10 minutes to generate.

It's the only thing I do that hits performance limitations on the 8GB machine so there's no regrets on that score, but with the way this stuff is progressing 16GB+ is a realistic minimum for comfortable use.


FWIW I'm on a 2021 16GB M1 MacBook Pro and it takes about 7min for me as well.

I've just been following the steps here with default settings: https://replicate.com/blog/run-stable-diffusion-on-m1-mac, but maybe there's a better way to run it at this point?


That's what I initially followed, too. But there does seem to be a better way - I've just installed CHARL-E [0] (mentioned elsewhere in this thread) and it was trivial to set up. Literally download the dmg, drag into Applications, and run.

It's an electron app and you can either download it with the weights, or without and add them separately.

Using that it just took slightly over 5 minutes on my 8GB so it's a little bit quicker for me. Maybe the code has improved since I cloned stuff over a week ago, or maybe it's just different system resources when it is run. Either way, it looks like the easy way I've been waiting for.

[0] https://www.charl-e.com

Edit: It is however missing the CFG setting.


I've been running the Intel CPU version [0] for a while now on a 2013 MacMini. Works fine; it takes several minutes per image but I can live with that.

[0] https://news.ycombinator.com/item?id=32642255


There is work on a CoreML version which may play nicer with older Macs w/sufficiently beefy dGPUs.

https://github.com/huggingface/diffusers/issues/443


Will the CoreML version run faster than https://replicate.com/blog/run-stable-diffusion-on-m1-mac on an M1 Mac?


> And boy, those are fucking expensive.

Unless you want to train the model, Lambda Labs is somewhat cheap:

https://lambdalabs.com/service/gpu-cloud


Apple's MPS drivers supports AMD GPUs on MacOS


Just heard about MPS in another thread.


I’ve been working on a queue-centric desktop app GUI for SD: https://twitter.com/westoncb/status/1568114946235580418?s=46...

I plan to wrap things up and put out the source this weekend.


https://www.charl-e.com/

There are a few bugs to iron out before it's ready for prime time. For now, create the folder `~/Desktop/charl-e/samples/` manually before you run it.


Great stuff, and works on an 8GB M1 Air taking between 5 and 10 minutes for between 5 and 15 steps. As a suggestion, perhaps add the option for setting the CFG too (I know, it's open source etc, but it's just a suggestion).


Funny, it probably does a whole lotta things, but it can't create the `~/Desktop/charl-e/samples/` directory? That seems like it should be relatively trivial...


Same! I wrote a public web app so that I could access the model from my phone [0]. This is how I found Replicate [1]. Their SD model is very cheap to use. While we wait for a native Mac app, I recommend accessing the model straight from their web UI.

[0] https://www.drawanything.app/

[1] https://replicate.com/


People recently figured out how to export stable diffusion to onnx so it’ll be exciting to see some actual web UIs for it soon (via quantized models and tfjs/onnxruntime for web)


Very cool! Can you link to where this is taking place?

A commenter mentioned today it might be possible to pre-download the model and load it into the browser from the local filesystem rather than include such a gigantic blob as an accompanying dependency, fighting different caching RFC's, security/usage restrictions, and anything else that might inadvertently trigger a re-download.

https://news.ycombinator.com/item?id=32777909#32779093


Support for ONNX export was just added to diffusers, but no runtime logic for scheduling yet.

https://github.com/huggingface/diffusers


Nice to know tricks for /sd-webui/

- activate advanced: create prompt matrix and use

@a painting of a (forest|desert|swamp|island|plains) painted by (claude monet|greg rutkowski|thomas kinkade)

- add different relative weights for words in a prompt:

watercolor :0.5 painting :0.2 by picasso :0.3

- Generate much larger images with your limited vram by using optimized versions of attention.py and model.py

https://github.com/sd-webui/stable-diffusion-webui/discussio...

- Generate "Loab the AI haunting woman" if you can (Try using textual inversion with negatively weighted prompts)

https://www.cnet.com/science/what-is-loab-the-haunting-ai-ar...


- add GFPGAN to fix distorted faces

https://github.com/sd-webui/stable-diffusion-webui/wiki/Inst...

- add RealESRGAN for better upscaling

https://github.com/sd-webui/stable-diffusion-webui/wiki/Inst...

- add LDSR for crazy good upscaling (for 10x the processing time)

https://github.com/sd-webui/stable-diffusion-webui/wiki/Inst...


It seems Midjourney generates better results than SD or Dall-E.

What's with the "hyper resolution", "4K, detailed" adjectives which are thrown left and right, while we are at it?


Those are prompt engineering keywords. SD is way more reliant on tinkering with the prompt than midjourney

https://moritz.pm/posts/parameters


MidJourney needs a lot of prompt engineering too. And Dall-E also. If you look at the prompt as an opportunity to describe what you want to see, the results are often disappointing. It works better to think backwards about how the model was trained, and what sorts of web caption words it likely saw in training examples that used the sorts of features you’re hoping it will generate. This is more of a process of learning to ask the model to produce things it’s able to produce, using its special image language.


The metadata and file names of the images in the source data set are also inputs for the model training. These keywords are common tags across images that have these characteristics, so in the same way it knows what a unicorn looks like, it also knows what a 4k unicorn looks like compared to a hyper rez unicorn.


Midjourney uses SD under the hood (you can see in their license), but they augnment the model in various ways.


The results in midjourney are significantly better than SD. I find it much easier to get to a good result in MJ and I've been trying to understand why. Anymore insight you could share?


Good engineering. Midjourney likely has a lot going on under the hood before your prompt actually gets to Stable Diffusion. As an example you can check out this research paper [0] which seeks to add prompt chaining to GPT-3 so you can "correct" it's outputs before it reaches back to the user. There's also no rule that states you can only make one call to SD, MJ likely bounces around a picture through a pipeline they've tuned to ensure your generated image looks more reasonable.

[0]: https://arxiv.org/abs/2110.01691


Midjourney takes their base models and does further training/guidance on them to bring out intentional aesthetic qualities. One of their main goals is to ensure that that their “default” style is beautiful no matter how simple the user’s prompt is.


Opinionated background injected prompt suffixes varying based on user input + post processing pipelines.


Midjourney is doing "secret sauce" post-processing to enhance the image returned from the model. SD just gives you back what the model spits out. That's how I understand it at least


I've been having a lot of fun with Stable Diffusion and Midjourney.

One thing that is very powerful with Stable Diffusion is using text inversion ( https://textual-inversion.github.io/ ) - you can add additional input samples to further extend the possibilities beyond what is included in the original model.


Can I run (train?) this textual inversion using the same consumer GPUs that work with stable-diffusion? Or does it require a much beefier machine


You can, though you might run into memory limitations running it on a GPU. There can be tuning done to lower the VRAM utilization, but I have been lucky enough to not need this - I do some CG work and ran into VRAM limitations there, so I'm on a 3090 with 24GB.

You can always run it on a CPU and utilize your RAM instead if needed, though the training might extend to 24+ hours that way.

Edit: Here's an example of someone successfully using textual inversion - https://www.reddit.com/r/StableDiffusion/comments/wz88lg/i_g...


Thanks


Another free offline and easy to use model with a GUI can be downloaded from here: https://grisk.itch.io/stable-diffusion-gui

For some genuinely incredible results try this pattern for instruction:

Portrait of {Name of some type of identity such as "Faerie Princess" or "Dragon Queen"} {Name of a celebrity such as "Scarlett Johansson"}, beautiful face, symmetrical face, tone mapped, intricate, elegant, highly detailed, digital painting, artstation, concept art, smooth, sharp focus, illustration, art by artgerm and Greg Rutkowski and Alphonse Mucha and Boris Vallejo and Johannes Voss and Aleksi Briclot and Michael Komarck.

Run several iterations of the same query as some results will have anomalies.


I have a 6gb 1660ti, barely holding on. Is a new 12gb card good enough for now, or should I go even higher to be safe for a few years of sd innovation?


I'm using it with a 2070 (4 year old card with 8gb vram) and it takes about 5 seconds for a 512x512 image. It's been plenty fast to have some fun, but I think I'd want faster if it was part of a professional work flow.


What settings? That seems faster than expected.


It was the defaults for the webui I used. Faster than I expected too, but the results were all legit.

Edit: Got home and was able to double check. It's actually a solid 10 seconds per image with the following settings: seed:466520488 width:512 height:512 steps:50 cfg_scale:7.5 sampler:k_lms. Still quick enough for some fun, but could be annoying if you're need to do multiple iterations a minute.


Two minutes with my 1060, sadly.


im on my 2020 macbook air m1 ... 512px image takes 2-3 minutes :(


The GeForce 4000 series is about to release and should make Stable Diffusion wayyyyy faster based on related H100 benchmarks posted today.


It sounds like there's forks that are able to work with <=8GB cards. And I'm not sure but I think the weights are using f32, so switching to half might make it yet easier still to get this to work w/less memory.

But yeah the next generation of models would probably capitalize on more memory somehow.


People have reported that this repo even works with 2gb cards if you run it with --lowvram and --opt-split-attention.


Yes, the amount of VRAM doesn't seem to be as much of a limitation anymore. However, processing power is still important.


How is M1/M2 support for SD? Is there a significant performance drop? Presumably you would be able to buy a 32GB M2 and be future proof because of the shared memory between CPU/GPU.


I recently switched from a CPU-only version to this repo release 1.13: https://github.com/lstein/stable-diffusion

The original txt2img and img2img scripts are a bit wonky and not all of the samplers work, but as long as you stick to dream.py and use a working sampler, I have had good luck with k_lms, then it works great and runs way faster than the cpu version.

Works great on 32gb ram but I'm honestly tempted to sell this one and get a 64gb model once the m2 pros come around. This is capable of eating up all the ram you can throw at it to do multiple pictures simultaneously.


In my setup at least it runs essentially in CPU mode since there is no CUDA acceleration available and metal support is really messy right now. So while quite slow I don't run into memory issues at least. It runs much faster on my desktop GPU but that has more constraints (until I upgrade my personal 1080 to a 3090 one of these days).


There was a long thread last week. It’s honestly pretty good if you follow the instructions. 30-40 seconds/image.


Yeah, I followed the instructions on a M1 Macbook Pro (Monterey 12.5.1) and it worked without extra effort. 30-40 seconds per image. I have 32GB but image generation doesn’t even use half of it. The hard part has been to generate prompts that do what I want.


Regarding the opening image: if it can't correctly put the marks on dice, how can it put eyes, nose and mouth correctly on a human face?


> Regarding the opening image: if it can't correctly put the marks on dice, how can it put eyes, nose and mouth correctly on a human face?

It helps if you consider it all as effectively advanced compression. Everything the model can do is limited by its architecture, the number of parameters in the model, and the accuracy and size of the training data.

The underlying architecture is a transformer (e.g. GPT3) wired to a (denoising) diffusion model.

Current flaws with this approach:

- Transformers seem to approach a "bag-of-words" model, often ignoring the ordering of the words. Among other things, this means that text-to-image models are very bad at "binding attributes" [0]. This is why "a boy wearing a red shirt and a girl wearing a black jacket" may fail (putting the colors on the wrong items, for instance).

- Autoregressive transformers have no means to correct early mistakes.

- Training data is captioned images and the captions are likely noisy and under-specified. Every time it sees a face labeled as "face" - it tries to generate a face from the distribution of _all_ faces in the data. The same goes for the dice. If the dice are just labeled "dice", but don't have a description of how they landed - the model has to guess which angle you're referring to. As a sibling comment points out, this is exacerbated by the relative frequencies of examples of the data in the dataset.

[0] https://wikiless.org/wiki/Binding_(linguistics)


It can’t. :)

Well, I kid a bit. I’ve seen it produce some amazing results, but, generally, it has a hard time with that. Often faces end up looking blurry or having these creepy, dead white eyes. Hands likewise often end up malformed (seven fingers anyone?) and twisty. But, it seems to have a much easier time generating passable faces in close ups with the right key words. Especially if you give it an input image that already has a clear one. It also seems to have an easier time doing faces it already knows like a celebrity, presumably because it’s using a strong existing influence instead of inventing/hallucinating it.

Supposedly this is improved in their new 1.5 version which is in beta. The software is so compelling that I suspect this will be improved quite quickly. Also, I think either way workarounds will emerge, either by composing with other networks/software (some UIs have GANs for face correction) or the old fashioned way by photoshopping over the blemishes.


It’s worth noting you also get MUCH better faces and hands if you’re willing to run it for more steps (100-150). It takes a lot longer to run it at higher step counts so a lot of people don’t do it.


Presumably the number of faces in the training set far exceeds the number of dice by more than a few orders of magnitude.


In one of the other posts I noticed this option:

> GFPGAN Face Correction: Automatically correct distorted faces with a built-in GFPGAN option, fixes them in less than half a second

So apparently there is still an issue with faces.


Yep, it gets faces mostly right, but as they say, the devil's in the details. Eyes in particular don't seem to have clearly delineated concentric circles for irises and pupils, instead they are often rendered as a "swirl".

Example image directly from Stable Diffusion:

https://i.imgur.com/XSk8fIv.png

And here is that image run through GFPGAN:

https://i.imgur.com/I53AGmh.png

Interesting to note how specialised GFPGAN is, as some of the other details (flowers, hair) seem to be worse in the processed image. I plan to finish this image by manually blending the best of both pictures.


Ironically, that almost looks like the sort of fantasy image you'd find as GPU box art many years ago.


In my experience, Stable Diffusion is also pretty bad about human faces and hands too


How does copyright work with output images? If someone runs the model on their own hardware, do they "own" the images generated? If 2 people generate the same image using the same prompt/seed, who "owns" the image?


https://www.smithsonianmag.com/smart-news/us-copyright-offic...

In the US, AI generated art cannot be copyrighted.

Edit:

Some additional details.

The US also denied a copyright for one where the creator listed themselves, with the AI just being a co-creator.

https://www.reddit.com/r/COPYRIGHT/comments/vshypc/the_us_co... (Original article is paywalled, reddit post contains the relevant bits)

In particular: >“Even though you argue that there is some human creative input present in the work that is distinct from RAGHAV’s contribution, this human authorship cannot be distinguished or separated from the final work produced by the computer program,” the office stated.

The US does seem to be a bit of an outlier here. The above work was granted copyright in Canada and India.

In the EU, AI generated artwork is likely copyrightable: https://link.springer.com/article/10.1007/s40319-021-01115-0

The same for the UK: https://www.kilburnstrode.com/knowledge/ai/ai-musings/respon...

Edit2: I'm not a lawyer, this isn't legal advice, go contact one if you actually need legal advice here.


That’s one case, not a ruling about all AI generated art. It won’t be the same for every image involving AI in some way. What if you use AI to fill in a portion of an image, as with Adobe’s content aware fill? What if you use a series of SD steps but with a human selecting outputs and feeding them back in as inputs to get something else the AI could not have come up with on its own? The copyright conversation is only just beginning.


>That’s one case, not a ruling about all AI generated art

"Because copyright law as codified in the 1976 Act requires human authorship, the Work cannot be registered."

The actual ruling (and a similar USPTO discussions) are about AI generated art and talk extensively about it in the broad case. The stance of these organizations is that AI generated art is not copyrightable. I don't disagree that the line is blurred when you discuss content aware fill, where the AI is working on a portion of it, but the current use of SD, even img2img and multiple prompts, etc., quite clearly falls outside of human authorship as recognized by the US Copyright and Patent offices.

https://www.copyright.gov/rulings-filings/review-board/docs/... https://www.uspto.gov/sites/default/files/documents/USPTO_AI...

Might this change in the future? Possibly. But as it stands today, I would not make any plans that assume you can secure the copyright (in the US) to anything made with SD.

Edit: Going through and noting that I'm not a lawyer and this isn't legal advice, don't listen to some random on the internet for legal advice, get a lawyer if you need it.


> quite clearly falls outside of human authorship as recognized by the US Copyright and Patent offices.

I think these are answering a slightly different question, as they are asking if the AI itself can hold the copyright on the output. A bit like if someone tried to copyright an image and assign “Photoshop” as the author.

The question above is maybe closer to asking if the person using an ML model can get copyright on the output, in that case there is a person trying to own the copyright, so I suspect it would not be rejected so easily.


>as they are asking if the AI itself can hold the copyright on the output.

Who is?

The original question I replied to: >If someone runs the model on their own hardware, do they "own" the images generated?

This seems to be straightforward - Thaler tried to receive the copyright for the artwork generated by his Creativity Machine. He was denied, because the copyright office does not believe that a neural network generated image has human authorship.

From the Copyright office paper: "he [Thaler] was “seeking to register this computer-generated work as a work-for-hire to the owner of the Creativity Machine.”"

>A bit like if someone tried to copyright an image and assign “Photoshop” as the author.

This is also clearly outside of the scope of copyrightable work per the reasoning given by the copyright office.

Both questions are thoroughly answered at this moment unless Thaler wins his appeal.

Edit: Going through and noting that I'm not a lawyer and this isn't legal advice, don't listen to some random on the internet for legal advice, get a lawyer if you need it.


> > as they are asking if the AI itself can hold the copyright on the output.

> Who is?

Thaler is. I’ve only read the intro sections of the documents you linked to, so I may have missed something more fundamental later, but the key points seem to be:

> The author of the Work was identified as the “Creativity Machine,” ... the Work “was autonomously created by a computer algorithm running on a machine”

and:

> Thaler must either provide evidence that the Work is the product of human authorship or convince the Office to depart from a century of copyright jurisprudence. He has done neither.

So in this case they are asking if the AI can be the author.

Whereas the question in this thread was:

> If someone runs the model on their own hardware, do they "own" the images generated?

In that case, a human is providing a prompt to the model (providing creative input), and asking if they themselves count as the author (a human rather than a neural net), so it seems like a significantly different case.


Thaler specifically asks for himself to be given the copyright assignment in the filing, claiming that the AI is essentially creating it in a work-for-hire. He does not ask for the Creativity Machine to be assigned the copyright.

>In that case, a human is providing a prompt to the model (providing creative input), and asking if they themselves count as the author (a human rather than a neural net), so it seems like a significantly different case.

I don't know that I specifically agree with this, but this is probably due to me having read additional articles on similar filings, including one where someone took a photograph, applied a style transfer AI to it, and then tried to copyright the resulting image, and was denied, because the copyright office found that there was not evidence that the work was a product of human authorship.

Andres Guadamuz (a lawyer specializing in IP law, senior lecturer at Sussex university, and a proponent of AI generated work being copyrightable) discusses a lot of this in https://www.technollama.co.uk/dall%c2%b7e-goes-commercial-bu... - but the most relevant part to this discussion is "For the most part, the legal consensus appears to be that the images do not have any copyright whatsoever, and that they’re all in the public domain."

The user experience for DALL-E, StableDiffusion, Midjourney, etc. are all essentially the same - craft a prompt, fine-tune it, get artwork out, so his discussion should be broadly applicable to all of these similar tools.


Thanks for the link, interesting reading. I can totally appreciate the angle that some generations may be too trivial to be worthy of protection.

I happen to be in the UK, and this happens to match my expectations, but it does strongly imply more regional variation than I’d have guessed:

> The situation may be different in the UK, where copyright law allows copyright on a computer-generated work, the author of which is the person who made the arrangements necessary for the work to be created. This, in my opinion, is the user, as we come up with the prompt and initiate the creation of the specific work. I think that there may be a good case to be made that I own the images I create in the UK.


Interesting, so if I run SD on a server in the UK, would the images technically be generated in the UK?


This seems odd, there is still a human authoring the images through the use of prompts etc. How does this differ from using a paint brush?


https://www.reddit.com/r/COPYRIGHT/comments/vshypc/the_us_co... talks about a (paywalled) article that discusses this problem.

This was a Style Transfer AI - it takes a source image and recreates it in the style of a painter.

In this case, the person both took the photo that the style was transferred to, and selected the style and a variety of variables. The US Copyright office still felt that his contribution was not distinguishable from the work that the AI did.

I'll note that this is very US specific - there are a lot of counter-examples of other countries allowing for the copyright of work like this, including the EU, UK, Canada, India, etc.


What if I create a fully automated image site where people can purchase images. All the images are generated by SD based on keywords I scrape from competitor websites. Would be quite easy to create a website like that.


Is it any different than Photoshop content aware fill? Or using a camera?

Nobody would ever think about Adobe or Nikon having copyright claims over your pictures. For me it's just a tool, the artistic part is providing a good description/base image, refining and choosing the best output.

Anyway, I'm not a lawyer and we probably live in different countries, so it'll be interesting to wait for the first lawsuit.


> Nobody would ever think about Adobe or Nikon having copyright claims over your pictures

But it is illegal to share pictures of Eiffel Tower, for example.

People do it, but they shouldn't.

If I put a picture of Eiffel Tower at night in a book or any other kind of commercial product, I have to pay to use it. Doesn't matter that it's there for my eyes to see it.

The question is: are the images generated by an hyper accelerated learning machine using copyrighted material without the author's consent legal?

I think they shouldn't be and the data included in the training should be free or licensed.


That's actually just French copyright Law, has never been challenged I believe, and applies to the night lights only.


What if Stable Diffusion generates those images ? :-)


if someone can dockerize this, please reply with a link!


The main one I've been testing (https://github.com/sd-webui/stable-diffusion-webui) has a Docker Compose file.

Grab that, install Docker, install Nvidia's Docker integration, copy the example Docker-env file, and docker-compose up is all you need.

Edit: here's a gist with exact steps I used: https://gist.github.com/geerlingguy/384ed4aba35e3118f2a0f358...


This is the dockerized version of this repo: https://github.com/AbdBarho/stable-diffusion-webui-docker


a sigh of relief as I thought that this would generate instead of pictures, React UI code based on plain text description.

First they came for illustrators, then they came for UI designers.



yes what I mean was mixing stable diffusion into the UI styling without the fine atomic details. Like say I want a dashboard with color themes from cyberpunk complete with graphics/art/logo. It would be a "good enough" category if something can literally be told to produce a complete frontend.


heh yeah it won’t be that long until we have A Stable Diffusion for Web UI


nit: seekbar is terrible UI to select exact large value, like resolution. I don't know why is it accepted.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: