It trialled it as an explicitly optional model for a moment a couple years ago. (or only a year? time moves so fast. somewhere in v2/v3 timeframe and around when SD came out). I am sure it is no longer the case.
DALL-E shares the same autoencoders as SD v1.x. It is probably similar to how Meta's Emu-class models work though. They tweaked the architecture quite a bit, trained on their own dataset, reused some components (or in Emu case, trained all the components from scratch but reused the same arch).
I pay for both MJ and DALL-E (though OpenAI mostly gets my money for GPT) and don't find them to produce significantly better images than popular checkpoints on CivitAI. What I do find is that they are significantly easier to work with. (Actually, my experience with hundreds of DALL-E generations is that it's actually quite poor in quality. I'm in several IRC channels where it's the image generator of choice for some IRC bots, and I'm never particularly impressed with the visual quality.)
For MJ in particular, knowing that they at least used to use Stable Diffusion under the hood, it would not surprise me if the majority of the secret sauce is actually a middle layer that processes the prompt and converts it to one that is better for working with SD. Prompting SD to get output at the MJ quality level takes significantly more tokens, lots of refinement, heavy tweaking of negative prompting, etc. Also a stack of embeddings and LoRAs, though I would place those more in the category of finetuning like you had mentioned.
That looks very impressive unless the demo is cherrypicked, would be great if this could be implemented into a frontend like Fooocus https://github.com/lllyasviel/Fooocus
What do you use it for? I haven't found a great use for it myself (outside of generating assets for landing pages / apps, where it's really really good). But I have seen endless subreddits / instagram pages dedicated to various forms of AI content, so it seems lots of people are using it for fun?
Nothing professional. I run a variety of tabletop RPGs for friends, so I mostly use it for making visual aids there. I've also got a large format printer that I was no longer using for it's original purpose, so I bought a few front-loading art frames that I generate art for and rotate through periodically.
Whose frames do you use? Do you like them? I print my photos to frame and hang, and wouldn't at all mind being able to rotate them more conveniently and inexpensively than dedicating a frame to each allows.
Perfectly suited to go alongside the style of frame I already have lots of, and very reasonably priced off the shelf for the 13x19 my printer tops out at. Thanks so much! It'll be easier to fill that one blank wall now.
I use comfyUI/SD and MJ and I have never seen anything on the level of what I get out of MJ. Nothing at CivitAI is impressive to me next to what I get from MJ.
Of course, art is so subjective none of this has any real meaning. MJ routinely blows my mind though and it is very rare something from SD does. The secret MJ sauce is obviously all the human feedback that has gone into the model at this point.
I think AI video will be a different story though. I think that is when comfyUI/SD will destroy MJ because MJ is simply not going to be able to have an economic model with the amount of compute needed.
Largely some old channels from the 90s/00s that really only exist as vestiges of their former selves - not really related to their original purpose, just rooms for hanging out with friends made there back when they had a point besides being a group chat.
Midjourney has absolutely nothing to offer compared to proper finetunes. DALL-E has: it generalizes well (can make objects interact properly for example) and has great prompt adherence. But it can also be unpredictable as hell because it rewrites the prompts. DALL-E's quality is meh - it has terrible artifacts on all pixel-sized details, hallucinations on small details, and limited resolution. Controlnets, finetuning/zero-shot reference transfer, and open tooling would have made a beast of a model of it, but they aren't available.
I'm actually a person making technical decisions (art decisions in the past) in a VFX/art studio, and I'm talking about production use. No generative AI currently passes any reasonable production quality bar, but is being tried by everyone for doing the work that can't be done or is cost-prohibitive otherwise, for example animation, long series with style transfer, filler assets creation etc. Anything that only has a text prompt can be discarded instantly. You have to be able to finetune it on your own material for consistency (of course I'm not talking about dubious 3rd party models), you need higher order guidance (e.g. controlnets, especially custom ones) and many other things. In the hands of a skilled person, a trivial Krita/Photoshop plugin (Firefly, SD, SD realtime) blows anything MJ can offer out of the water, simply because it has all that and you can't do much with text, it doesn't have enough semantic capacity to express artistic intent. I'm not even starting on animation.
In fact, anything that involves non-explicitly guided one-shot generation of anything with light/shadow/colors/perspective is entirely out of the question with the current crop, because all models are hallucinating hard and aren't controllable within a single generation. There are attempts at fixing the perspective without explicit guidance, but it's going to be a long way and it's not super relevant to how things are done anyway.
And for fine art, nothing beats a human painter, doing it by throwing prompts at AI mostly misses the point. I'm not even sure what you mean by fine art in this context, actually - surely not generating artsy-looking images from a prompt for fun?
I think it'd be interesting to have a non-profit "model sharing" platform, where people can buy/sell compute. When you run someone's model, they get royalties on the compute you buy.
The net flow of knowledge about text-to-image generation from OpenAI has definitely been outward. The early open source methods used CLIP, which OpenAI came up with. Dall-e (1) was also the first demonstration that we could do text to image at all. (There were some earlier papers which could give you a red splotch if you said stop sign or something years earlier).
The GPL was intended for computer code that gets compiled to a binary form. You can share the binary, but you also have to share the code that the binary is compiled from. Pre-trained model weights might be thought of as analogous to compiled code, and the training data may be analogous to program code, but they're not the same thing.
The model weights are shared openly, but the training data used to create these models isn't. This is at least partly because all these models, including OpenAI's, are trained on copyrighted data, so the copyright status of the models themselves is somewhat murky.
In the future we may see models that are 100% trained in the open, but foundational models are currently very expensive to train from scratch. Either prices would need to come down, or enthusiasts will need some way to share radically distributed GPU resources.
Tbh I think these models will largely be trained on synthetic datasets in the future. They are mostly trained on garbage now. We have been doing opt outs on these, has been interesting to see quality differential (or lack thereof), eg removing books3 from stableLM 3b zephyr https://stability.wandb.io/stability-llm/stable-lm/reports/S...
Why aren’t the big models trained on synthetic datasets now? What’s the bottleneck? And how do you avoid amplifying the weaknesses of LLMs when you train on LLM output vs. novel material from the comparatively very intelligent members of the human species. Would be interesting to see your take on this.
There are approaches to get the right type of augmented and generated data to feed these models right, check out our QDAIF paper we worked on for example
I’ve wondered whether books3 makes a difference, and how much. If you ever train a model with a proper books3 ablation I’d be curious to know how it does. Books are an important data source, but if users find the model useful without them then that’s a good datapoint.
What I mean is, it’s important to train a model with and without books3. That’s the only way to know whether it was books3 itself causing the issue, or some artifact of the training process.
One thing that’s hard to measure is the knowledge contained in books3. If someone asks about certain books, it won’t be able to give an answer unless the knowledge is there in some form. I’ve often wondered whether scraping the internet is enough rather than training on books directly.
But be careful about relying too much on evals. Ultimately the only benchmark that matters is whether users find the model useful. The clearest test of this would be to train two models side by side, with and without books3, and then ask some people which they prefer.
It’s really tricky to get all of this right. But if there’s more details on the pes2o ablations I’d be curious to see.
The main reason is probably Mid journey and OpenAi using their tech without any kind of contribution back. AI desperately needs a GPL equivalent…