Hacker News new | past | comments | ask | show | jobs | submit login

End of the day, unless it's opened up Dall-E 2 will be seen as an evolutionary dead end of this tech and a misstep.

It's gone from potentially one of the most innovative companies on the horizon to a dead product now I can spin up equivalent tech on my own machine, hook into my workflow and tools in an afternoon all because Stable Diffusion released their model into the wild.




Yeah, until Stable Diffusion became available, I felt that Dall-E 2 stance on not opening it up was sorta reasonable. Mostly because "groundbreaking tech producing all these impressive results that cost a ton to build, and I bet stable diffusion announcements were all just riding the hype, and it will disappoint at the end."

I have never eaten my words as fast as I did when stable diffusion had finally released. Such a gamechanger while running locally, it isn't even funny. All these parameters and samplers one could play with, use a one-click simple gui or cli or a web front-end or even hook it up to any existing code flow that you have going. And it all just works well, with every week bringing new advancements (like img2img).

I haven't heard anyone talk about Dall-E 2 in over a month. Like, at all. All while stable diffusion overtook pretty much every single social circle related to image generation that I was in.

Major props to the stable diffusion team. They had a very high bar to reach, and not only they managed to do it extremely fast, they blew way past it. Leading by example, in the face of all the "no no no, we gotta keep the model and everything closed, and you shouldnt be able to run it locally, oh and also we hardcoded input filters because safety and we know better than you do" bs arguments was extremely satisfying to watch.


Dall-e is kind of sad. It reminds me of Google Video, that thing before YT.


My stable diffusion output looks awful. I've been trying to recreate the xkcd about Joe Biden eating sandwiches, so I try something like "Joe Biden eating a sandwich in the oval office, 4k render, photograph" and I get nightmare fuel with pieces of bread attached to his head, faces that dissolve into random geometric shapes, toppings that melt into hands while a sandwich sits on a plate in front of him, etc.

I had high hopes based on posts from the SD subreddit, and I figured Biden would be well represented in the training data. Am I missing something?


SD isn't great at generating images for detailed, weird prompts (at least not compared to DALL-E2). If you're not great at prompt writing or just having bad luck, you can use img2img with a rough sketch of what you want.


Here, a less specific prompt:

"Portrait of Joe Biden in the oval office, 4k render"

First attempt: https://pasteboard.co/IYo5m6KeaqF4.png

OK, I'll grant you that all of the parts of a face are there and reasonably correct (except for the bottomless pits of darkness in his nose and mouth)

Second attempt, I end up with these weird artifacts in his head half of the time (3/5 of my generated images)

* https://pasteboard.co/xDFv9KD7Or4U.png

* https://pasteboard.co/j7PqHPgorZ9G.png

Am I holding it wrong somehow?


What is your guidance scale number, the number of iterations, and the chosen sampler? Those would be very relevant to know. Pretty much the most relevant thing aside from the prompt itself.

Setting guidance scale number higher typically results in imagery getting trippier and more surreal with more artifacts. So i feel like that's the main culprit for the artifacts.

I am pretty curious to see how far we can get with this prompt. So I will try playing with it later today and post the results and what I found in a reply to this comment.


I'm using the default settings on the webui, here are the parameters:

``` Portrait of Joe Biden in the oval office, 4k render seed:1331361607 width:512 height:512 steps:50 cfg_scale:7.5 sampler:k_lms ```


Thanks for providing the seed, because that would let me show you how exactly the parameters can affect your specific image without generating a "random" different one every time.

Check out the exact same seed and prompt and cfg_scale, but with the steps aka iteration number at 100 (50 in general feels way too low, even for the samplers that are kinda good with low iteration numbers).

https://pasteboard.co/I6yXg5mZip6D.png

Obvious glitchiness in the face. Below is the same one, but with a k_euler_a sampler (I don't use k_lms, mostly k_euler_a or k_dpm_2_a) + 100 iterations.

https://pasteboard.co/xaTJiN6eVhm2.png

Less glitchiness, but Joe looks more caricature-like, than real. And also, not super quite like Joe. Let's try the same, but at 150 iterations and set the CFG at 10.

https://pasteboard.co/a9OigPXS9Ky1.png

Not much different in terms of realism, but the person looks distinctly way more like Joe. Let's up the iteration number to 200.

https://pasteboard.co/ey0ZzC110CrK.png

We got a bit closer to what we wanted. Faces are a bit of a difficult thing to do, but i think we can figure it out. Overall, it feels a bit "wobbly". I noticed that it tends to be beneficial to decrease the CFG as you increase iteration number, if you want photos to be more photorealistic. Let's set it to 6, and the iteration at 200.

https://pasteboard.co/na1fH54LkqO2.png

I would say this looks pretty good, but I think we can do better. The important part is imo the prompt, and I think we can edit yours to get a bit better results. Here is the result for "portrait of Joe Biden in oval office, dslr" with 200 iterations, CFG at 6, and k_euler_a sampler.

https://pasteboard.co/noHbMgxhLU4s.png

That one was probably my favorite (or maybe it was the one before).

Overall, you can play with this almost infinitely. Adding different words to the prompt in different spots can yield pretty different results. And that's not even mentioning all the parameters one can tune.


that was a cool detailed breakdown.

how did you learn all that? i'm a noob at prompt engineering but would love to know how to get good at it with the least amount of time.

any good reasources to get up to speed?


You might try using image to image with the comic and previously mentioned prompt as inputs.


I wish the same happened with GPT-3


Casual users don't have a workflow to hook into, though. A website will be more convenient for them since there's nothing to install, and the web app probably runs faster than whatever hardware they're using.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: