Fidelity on the output isn't great, but the coherence (assuming the examples weren't massively cherry-picked) seems very good. Given the number of parameters this should be able to run on end-user machines, and in theory this could be fine tuned to produce better looking output than stable diffusion/etc.
What this model does more than anything else is demonstrate we're still in the early stages of generative models, and we can expect a lot of progress from architectural improvements over the next decade (in addition to the progress in compute and data that we're already counting on).
It'd be interesting to see some results where the training set has higher artistic quality (and how this model influences the "house style"). The output does not look great when compared to what other (trained) models deliver.
But the promise of a big efficieny gain will be an incentive for companies like midjourney to give it a go with their data.
More amazement . I wonder where this field will end up. Cute animal and nature images are nice but have limited real-life use (i mean, we have to accept that visual media ends after everyone can be an artist). I wonder when we 'll start interfacing language models with robotics to do some real-life work
I don't think we have to accept the idea that "visual media ends after everyone can be an artist." On the contrary: such a scenario may well make visual media even more relevant than it is today. Already we communicate concepts and emotions with gifs, icons, memes...
This can go in any number of fantastical directions. But visual media both as a private/personal medium and salve as well as an enterprise-grade tool of mass entertainment and propoganda? Baby we're just getting started!
> Compared to pixel-space diffusion models, such as Imagen and DALL-E 2, Muse is significantly more efficient due to the use of discrete tokens and requiring fewer sampling iterations;
Am I wrong or is that the same architecture as DALL-E 1?
Most google ML work is just papers; correct me if I’m wrong. Some models have made their way to hugging face like T5 but I don’t think any have a web interface.