For reference, here's what you can get with a properly tweaked Stable Diffusion,...

ewjt · on June 20, 2023

Can you elaborate on “properly tweaked”? When I use one of the Stable Diffusion and AUTOMATIC1111 templates on runpod.io, the results are absolutely worthless.

This is using some of the popular prompts you can find on sites like prompthero that show amazing examples.

It’s been serious expectation vs. reality disappointment for me and so I just pay the MidJourney or DALL-E fees.

kouteiheika · on June 20, 2023

> Can you elaborate on “properly tweaked”?

In a nutshell:

1. Use a good checkpoint. Vanilla stable diffusion is relatively bad. There are plenty of good ones on civitai. Here's mine: https://civitai.com/models/94176

2. Use a good negative prompt with good textual inversions. (e.g. "ng_deepnegative_v1_75t", "verybadimagenegative_v1.3", etc.; you can download those from civitai too) Even if you have a good checkpoint this is essential to get good results.

3. Use a better sampling method instead of the default one. (e.g. I like to use "DPM++ SDE Karras")

There are more tricks to get even better output (e.g. controlnet is amazing), but these are the basics.

renewiltord · on June 20, 2023

Thank you. I assume there's some community somewhere where people discuss this stuff. Do you know where that is? Or did you just learn this from disparate sources?

kouteiheika · on June 20, 2023

> I assume there's some community somewhere where people discuss this stuff. Do you know where that is? Or did you just learn this from disparate sources?

I learned this mostly by experimenting + browsing civitai and seeing what works + googling as I go + watching a few tutorials on YouTube (e.g. inpainting or controlnet can be tricky as there are a lot of options and it's not really obvious how/when to use them, so it's nice to actually watch someone else use them effectively).

I don't really have any particular place I could recommend to discuss this stuff, but I suppose /r/StableDiffusion/ on Reddit is decent.

bavell · on June 20, 2023

Pretty good reddit community, lots of (N/SFW) models and content on CivitAI. Took me a weekend to get setup and generating images. I've been getting good results on my AMD 6750XT with A1111 (vladmandic's fork).

Lerc · on June 20, 2023

What kind of(and how much) data did you use to train your checkpoint?

I'd like to have a go at making one myself targeted towards single objects (be it car,spaceship, dinner plate, apple, octopus, etc). Most checkpoints are very heavily leaning towards people and portraits.

AuryGlenz · on June 21, 2023

I’m not the OP but I’ve made some of my daughter, wife, dog, niece, etc.

People generally suggest 30+ images. I’ve found - at least with people - the more the better. My wife’s model is trained on ~80 images of her.

orbital-decay · on June 20, 2023

Are you using txt2img with the vanilla model? SD's actual value is in the large array of higher-order input methods and tooling; as a tradeoff, it requires more knowledge. Similarly to 3D CGI, it's a highly technical area. You don't just enter the prompt with it.

You can finetune it on your own material, or choose one of the hundreds of public finetuned models. You can guide it in a precise manner with a sketch or by extracting a pose from a photo using controlnets or any other method. You can influence the colors. You can explicitly separate prompt parts so the tokens don't leak into each other. You can use it as a photobashing tool with a plugin to popular image editing software. Things like ComfyUI enable extremely complicated pipelines as well. etc etc etc

nomand · on June 20, 2023

Is there a coherent resource (not a scattered 'just google it' series of guides from all over the place) that encapsulates some of the concepts and workflows you're describing? What would be the best learning site/resource for arriving at understanding how to integrate and manipulate SD with precision like that? Thanks

rwiggins · on June 21, 2023

I have found http://stable-diffusion-art.com to be an absolutely invaluable (and coherent) resource. It's highly ranked on Google for most "how to do X with stable diffusion" style searches, too.

kouteiheika · on June 20, 2023

> What would be the best learning site/resource for arriving at understanding how to integrate and manipulate SD with precision like that?

Honestly? Probably YouTube tutorials.

TeMPOraL · on June 20, 2023

Jaysus.

I'm going to sound like an entitled whiny old guy shouting at clouds, but - what the hell; with all the knowledge being either locked and churned on Discord, or released in form of YouTube videos with no transcript and extremely low content density - how is anyone with a job supposed to keep up with this? Or is that a new form of gatekeeping - if you can't afford to burn a lot of time and attention as if in some kind of Proof of Work scheme, you're not allowed to play with the newest toys?

I mean, Discord I can sort of get - chit-chatting and shitposting is easier than writing articles or maintaining wikis, and it kind of grows organically from there. But YouTube? Surely making a video takes 10-100x the effort and cost, compared to writing an article with some screenshots, while also being 10x more costly to consume (in terms of wasted time and strained attention). How does that even work?

bavell · on June 20, 2023

I've been playing with SD for a few months now and have only watched 20-30m of YT videos about it. There's only a few worth spending any time watching, and they're on specific workflows or techniques.

Best just to dive in if you're interested IMO. Otherwise you'll get lost in all the new jargon and ideas. Great place to start is the A1111 repo, lot of community resources available and batteries included.

orbital-decay · on June 20, 2023

How does anyone keep up with anything? It's a visual thing. A lot of people are learning drawing, modeling, animation etc in the exact same way - by watching YouTube (a bit) and experimenting (a lot).

TeMPOraL · on June 20, 2023

Picking images from generated sets is a visual thing. Tweaking ControlNet might be too (IDK, I've never got a chance to use it - partly because of what I'm whining about here). However, writing prompts, fine-tuning models, assembling pipelines, renting GPUs, figuring out which software to use for what, where to get the weights, etc. - none of this is visual. It's pretty much programming and devops.

I can't see how covering this on YouTube, instead of (vs. in addition to) writing text + some screenshots and diagrams, makes any kind of sense.

washadjeffmad · on June 23, 2023

This isn't for Stable Diffusion, but I wanted to provide a supplemental to my comment: https://kaiokendev.github.io/til

This is the level we're generally working at - first or second party to the authors of the research papers illustrating implementations of concepts, struggling with the Gradio interface, things going straight from commit to production.

It's way less frustrating to follow all of the authors in the citations of the projects you're interested in than wasting your attention sorting through blogspam, SEO, and YT trash just to find out they don't really understand anything, either.

TeMPOraL · on June 23, 2023

Thank you. I was reluctant to chase after and track first-party research directly, or work directly derived from it, as my limited prior experience told me it's not the most efficient thing unless I want to go into that field of research myself. You're changing my mind about this; from now, I'll try sticking close to source.

washadjeffmad · on June 21, 2023

There's a relatively thin layer between the papers and implementations, which is another way of saying this stuff is still for researchers and assumes a requisite level of background with them. It sounds like you'd benefit from seeking out the first party sources.

This is where video demonstrations come in handy. Since many concepts are novel, it's uncommon to find anyone who deeply understands them, but it's very easy to find people who have picked up on some tricks of the interfaces, which they're happy to click through. I think gradio/automatic1111 makes learning harder than it needs to be by hiding what it's doing behind its UI, while eg- comfyui has a higher initial learning curve but provides a more representational view of process and pipelines.

sorenjan · on June 20, 2023

Take a moment and go scroll through the examples at civitai.com. Does most of them strike you as something by people with jobs? Most of them are pretty juvenile, with pretty women and various anime girls.

sebzim4500 · on June 20, 2023

Are you under the impression that people with jobs don't like pretty women and anime girls?

sorenjan · on June 20, 2023

Of course not, but it looks like a teenage boy's room.

kaitai · on June 20, 2023

An operative word here is people.... the set "people with jobs" contains a far higher fraction of folks who like attractive men than is represented here....

dustypotato · on June 21, 2023

The difference being that youtube videos can make more money for the author. Anyway, it's all open source, so feel free to make a wiki

TeMPOraL · on June 21, 2023

I would if I could keep up with the videos :).

dustypotato · on June 21, 2023

I think it'd have been convenient for me as well if the AI tool that has access to YouTube videos would've been able to answer queries . But it takes 5 minutes to reply and I forgot it's name. It was on the front page recently

kouteiheika · on June 20, 2023

I mostly agree, but in this case it can be genuinely useful to actually see the process of someone using the tool effectively.

bavell · on June 20, 2023

ComfyUI is a nice complement to A1111, the node-based editor is great for prototyping and saving workflows.

og_kalu · on June 20, 2023

You're not going to get even close to Midjourney or even Bing quality on SD without finetuning. It's that simple. When you do finetune, it will be restricted to that aesthetic and you won't get the same prompt understanding or adherence.

For all the promise of control and customization SD boasts, Midjourney beats it hands down in sheer quality. There's a reason like 99% of ai art comic creators stick to Midjourney despite the control handicap.

orbital-decay · on June 20, 2023

Yet you are posting this in a thread where GP provided actual examples of the opposite. Look for another comment above/below, there are MJ-generated samples which are comparable but also less coherent than the result from a much smaller SD model. And in case of MJ hallucinations cannot be fixed. MJ is good but it isn't magic, it just provides quick results with little experience required; prompt understanding is still poor, and will stay poor until it's paired with a good LLM.

Neither of the existing models gives actually passable production-quality results, be it MJ or SD or whatever else. It will be quite some time until they get out of the uncanny valley.

> There's a reason like 99% of ai art comic creators stick to Midjourney

They aren't. MJ is mostly used by people without experience, think a journalist who needs a picture for an article. Which is great and it's what makes them good money.

As a matter of fact (I work with artists), for all the surface-visible hate AI art gets in the artist community, many actual artists are using it more and more to automate certain mundane parts of their job to save time, and this is not MJ or Dall-E.

Miraste · on June 20, 2023

There's a distinction to be made here. Everything that makes SD a powerful tool is the result of being open source. The actual models are significantly worse than Midjourney. If an MJ level model had the tooling SD does it would produce far better results.

bavell · on June 20, 2023

> If an MJ level model had the tooling SD does it would produce far better results

And vice versa, which is the exciting part to me - only a matter of time!

whywhywhywhy · on June 20, 2023

Midjourney output all has the same look to it.

If you’re ok with basic aesthetics it’ll work but if you want something a bit less cringe or that will stand out in marketing it won’t cut it.

Miraste · on June 20, 2023

It only has the same look if it's not given any style keywords. I've been impressed with the output diversity once it's told what to do. It can handle a wide range of art styles.

whywhywhywhy · on June 21, 2023

Then we need to give style keywords to the other networks too, and suddenly the gap shortens.

Default Midjourney is one thing and that’s mid…

og_kalu · on June 20, 2023

>Yet you are posting this in a thread where GP provided actual examples of the opposite.

Opposite of what ? OP posts results from a tuned model.

orbital-decay · on June 20, 2023

Opposite of this:

>For all the promise of control and customization SD boasts, Midjourney beats it hands down in sheer quality.

The results are comparable, but MJ in this comment https://news.ycombinator.com/item?id=36409043 hallucinates more (look at the roofs in the second picture). And it cannot be fixed, maybe except for an upscale making it a bit more coherent. Until MJ obtains better tooling (which it might in the next iteration), it won't be as powerful. I'm not even starting on complex compositions, which it simply cannot do.

>OP posts results from a tuned model.

Yes, which is the first step you should do with SD, as it's a much smaller and less capable model.

zirgs · on June 21, 2023

If course it's a tuned model. Why would anyone use stock SD these days?

chankstein38 · on June 20, 2023

I feel like people shouldn't talk in definitives if their message is just going to demonstrate they have no idea what they're talking about.

og_kalu · on June 20, 2023

I know what i'm talking about lol. I tuned a custom SD model that's downloaded thousands of times a month. I'm speaking from experience more than anything. Don't know why some SD users get so defensive.

SV_BubbleTime · on June 20, 2023

You load a model and have 6 sliders instead of one… it’s not exactly “fine tuning”.

If you want the power, it’s there. But nearly bone stock SD in auto1111 is going to get to any of these examples easily.

Show me the civitai equivalent for MJ or Dalle2. It doesn’t exist.

og_kalu · on June 20, 2023

>You load a model and have 6 sliders instead of one… it’s not exactly “fine tuning”.

Ok...? Read what i wrote carefully. Your 6 sliders won't produce better images than midjourney for your prompt on the base SD model.

zirgs · on June 21, 2023

Midjourney has a riduculously restrictive keyword filter. You should have mentioned that.

Also I see nothing wrong with using different models for different purposes.

capybara_2020 · on June 20, 2023

First off are you using a custom model or the default SD model? The default model is not the greatest. Have you tried controlnet?

But yes SD can be a bit of a pain to use. Think of it like this. SD = Linux, Midjourney = Windows/MacOS. SD is more powerful and user controllable but that also means it has a steeper learning curve.

senko · on June 20, 2023

I am sure you're right, but "if you know what you're doing" does a lot of heavy lifting here.

We could just as easily say "hosting your own email can be set up in a few minutes if you know what you're doing". I could do that, but I couldn't get local SD to generate comparable images if my life depended on it.

caseyf · on June 20, 2023

If you have an apple device, there is free GUI for Stable Diffusion called "Draw Things. It is nice and it just works. https://apps.apple.com/us/app/6444050820

screenshot of the options interface: https://stash.cass.xyz/drawthings-1687292611.png

ProfessorLayton · on June 20, 2023

Wow. It's both amazing and for some reason horrifying that this can run on an iPhone 11 (non-pro), and at reasonable speeds!

jfdi · on June 20, 2023

Nice! Would you mind sharing which stable diff you used / where you obtained from?

kouteiheika · on June 20, 2023

I'm using my own custom trained model.

Here, I've uploaded it to civitai: https://civitai.com/models/94176

There are plenty of other good models too though.

bavell · on June 20, 2023

Any tips or guides you followed on training your custom model? I've done a few LoRAs and TI but haven't gotten to my own models yet. Your results look great and I'd love a little insight into how you arrived there and what methods/tools you used.

kouteiheika · on June 21, 2023

I'm not an expert at this and there are probably better ways to do this/might not work for you/your mileage may vary, so please take this with a huge grain of salt, but roughly this worked for me:

1. Start with a good base model(s) from which to train from.

2. Have a lot of diverse images.

3. Ideally train for only one epoch. (Having a lot of images helps here.)

4. If you get bad results lower the learning rate and try again.

5. After training try to mix your finetuned model with the original one, in steps of 10%, generate X/Y plot of it, pick the best result.

6. Repeat this process as long as you're getting an improvement.

For training I mostly used scripts from here: https://github.com/bmaltais/kohya_ss

The main problem here is that essentially during inference you're using a bag of tricks to make the output better (e.g. good negative embeddings), but when training you don't. (And I'm not entirely sure how you'd actually integrate those into the training process; might be possible, but I didn't want to spend too much time on it.) So your fine tuning as-is might improve the output of the model when no tricks are used, but it can also regress it when the tricks are used. Which I why I did the "mix and pick the best one" step.

But, again, I'm not an expert at this and just did this for fun. Ultimately there might be better ways to do it.

bavell · on June 22, 2023

Great tips, thank you! It feels like I'm right behind you in terms of where I'm at so your input is very much appreciated.

3. Train for only 1 epoch - interesting, any known rationale here?

5. I just read somewhere else that someone got good results from mixing their custom model with the original (60/40 in their case) - good to hear some more anecdotes that this is pretty effective. Especially the further training after merging, sounds promising!

I've also been using kohya_ss for training LoRAs so great to hear it works for you for models as well. On your point about the inference tricks, definitely noted but I did notice that you can feed some params (# of samples, negative embeddings, etc) to the sample images generated during training (check the textarea placeholder text). Still not going to have all usual the tricks but it'll get you a little closer.

zirgs · on June 21, 2023

Make sure that you have enough vram. I can train loras with 8 gb easily, but when I tried to train a model - it gives me an oom error.

bavell · on June 22, 2023

Hopefully my 12GB is enough!

bluetidepro · on June 20, 2023

Do you have any good tutorial links to setup Stable Diffusion locally?

muhammadusman · on June 20, 2023

thanks for doing this, I would like to include these into the blog post as well. Can I use these and credit you for them? (let me know what you'd like linked)

kouteiheika · on June 20, 2023

Sure. No need to credit me.

muhammadusman · on June 20, 2023

thanks, updated the post with your results as well :)

tambourine_man · on June 21, 2023

Those are amazing, please consider writing a blog post of the steps you did to install and tweak Stable Diffusion to achieve these results. I'm sure many of us would love to read it.

troupo · on June 20, 2023

"Just" use a "properly tweaked" something.

bibanez · on June 21, 2023

You got incorporated into the article! Nice.