End of the day, unless it's opened up Dall-E 2 will be seen as an evolutionary dead end of this tech and a misstep.
It's gone from potentially one of the most innovative companies on the horizon to a dead product now I can spin up equivalent tech on my own machine, hook into my workflow and tools in an afternoon all because Stable Diffusion released their model into the wild.
Yeah, until Stable Diffusion became available, I felt that Dall-E 2 stance on not opening it up was sorta reasonable. Mostly because "groundbreaking tech producing all these impressive results that cost a ton to build, and I bet stable diffusion announcements were all just riding the hype, and it will disappoint at the end."
I have never eaten my words as fast as I did when stable diffusion had finally released. Such a gamechanger while running locally, it isn't even funny. All these parameters and samplers one could play with, use a one-click simple gui or cli or a web front-end or even hook it up to any existing code flow that you have going. And it all just works well, with every week bringing new advancements (like img2img).
I haven't heard anyone talk about Dall-E 2 in over a month. Like, at all. All while stable diffusion overtook pretty much every single social circle related to image generation that I was in.
Major props to the stable diffusion team. They had a very high bar to reach, and not only they managed to do it extremely fast, they blew way past it. Leading by example, in the face of all the "no no no, we gotta keep the model and everything closed, and you shouldnt be able to run it locally, oh and also we hardcoded input filters because safety and we know better than you do" bs arguments was extremely satisfying to watch.
My stable diffusion output looks awful. I've been trying to recreate the xkcd about Joe Biden eating sandwiches, so I try something like "Joe Biden eating a sandwich in the oval office, 4k render, photograph" and I get nightmare fuel with pieces of bread attached to his head, faces that dissolve into random geometric shapes, toppings that melt into hands while a sandwich sits on a plate in front of him, etc.
I had high hopes based on posts from the SD subreddit, and I figured Biden would be well represented in the training data. Am I missing something?
SD isn't great at generating images for detailed, weird prompts (at least not compared to DALL-E2). If you're not great at prompt writing or just having bad luck, you can use img2img with a rough sketch of what you want.
What is your guidance scale number, the number of iterations, and the chosen sampler? Those would be very relevant to know. Pretty much the most relevant thing aside from the prompt itself.
Setting guidance scale number higher typically results in imagery getting trippier and more surreal with more artifacts. So i feel like that's the main culprit for the artifacts.
I am pretty curious to see how far we can get with this prompt. So I will try playing with it later today and post the results and what I found in a reply to this comment.
Thanks for providing the seed, because that would let me show you how exactly the parameters can affect your specific image without generating a "random" different one every time.
Check out the exact same seed and prompt and cfg_scale, but with the steps aka iteration number at 100 (50 in general feels way too low, even for the samplers that are kinda good with low iteration numbers).
Obvious glitchiness in the face. Below is the same one, but with a k_euler_a sampler (I don't use k_lms, mostly k_euler_a or k_dpm_2_a) + 100 iterations.
Less glitchiness, but Joe looks more caricature-like, than real. And also, not super quite like Joe. Let's try the same, but at 150 iterations and set the CFG at 10.
We got a bit closer to what we wanted. Faces are a bit of a difficult thing to do, but i think we can figure it out. Overall, it feels a bit "wobbly". I noticed that it tends to be beneficial to decrease the CFG as you increase iteration number, if you want photos to be more photorealistic. Let's set it to 6, and the iteration at 200.
I would say this looks pretty good, but I think we can do better. The important part is imo the prompt, and I think we can edit yours to get a bit better results. Here is the result for "portrait of Joe Biden in oval office, dslr" with 200 iterations, CFG at 6, and k_euler_a sampler.
That one was probably my favorite (or maybe it was the one before).
Overall, you can play with this almost infinitely. Adding different words to the prompt in different spots can yield pretty different results. And that's not even mentioning all the parameters one can tune.
Casual users don't have a workflow to hook into, though. A website will be more convenient for them since there's nothing to install, and the web app probably runs faster than whatever hardware they're using.
It's gone from potentially one of the most innovative companies on the horizon to a dead product now I can spin up equivalent tech on my own machine, hook into my workflow and tools in an afternoon all because Stable Diffusion released their model into the wild.