"Type what you want to see and get a drawing of it" is very powerful, even if its scope is still limited.
1. People buy photography, even stock photography for ads or business uses, with some intention to affiliate with the photographer / artist or some notion of an artsy style or aesthetic. Autogenerated art starts right off the bat at a disadvantage for being “commodity” in nature, even compared to repetitive inventory of sites like Shutterstock. Maybe you can get past this for certain niche areas where the real photos are already exceedingly commodity, like backgrounds, office photos, landscapes. But even then, the status of an artist counts for a lot.
2. It’s not actually that cheap to operate generative image at scale. You have to ensure that pre-generated content is of sufficient quality, and covers sufficient subject matter, compositional and aesthetic variety. If content is generated on the fly, you’ll be dealing with pretty high throughput on a very resource intensive model.
3. Competition can replicate your image model pretty easily, so your differentiator comes back to branding and a sense of “not commodity” quality, as well as all accompanying services and support, which is where all your operating costs come from anyway.
I am sure generative inventory will become a bigger trend in stock photography, but I doubt it will be much of a differentiator. If you run a stock photography business with more than a few million images already, you would be better served building ML solutions for search, discovery, keyword or caption annotation, abusive content detection, automated aesthetic enhancements or assisted editing tools & style transfer. There won’t be a “holy grail” of generated inventory. Most customers just won’t care.
It’s a good example of how impressive exotic ML solutions might seem like they surely have to have consumer applications, but where it interacts with business concerns it just doesn’t matter. Monetizing ML solutions is really hard - much harder than creating the ML solutions in the first place.
What I'd like to see in generative models is the generation of diagrams, geometry figures, mind-maps, data and algorithm drawings based off text inputs. I want a super boosted imagination power in a box, a tool to model problems.
import all_photos from "./photos";
model = train(all_photos)
a = model.generate();
Anybody have any leads..?
It's quite important to emphasise the dichotomy between the 2 current approaches in the image synthesis today. 1. Implicit distribution learning - VAE or autoregressive based techniques 2. Explicit learning of the distribution - GAN based models. The way these two model the distribution is fundamentally different.
Fundamentally, GAN presents huge drawbacks when it comes to the actual inference during the synthesis. There have been dozens of models with workarounds but most of these present new challenges on their own especially instability and mode collapse being one of the primary.
VQVAE2 as the most advanced VAE based technique has eliminated major drawbacks of VAE and GAN and has produced phenomenal quality 
However the main challenge in the area is not synthesising just any kind of image. VQVAE2 is doing that already very well. Where none of the current techniques win today, is the multi-object image synthesis. That requires a new paradigm in the architecture and distribution learning.
'Explicit' models (I think this term is nonstandard) parameterize the density directly and modify the parameters via maximum likelihood. This allows one in theory to both directly evaluate the density and sample from the learned distribution. VAEs (only give a lower bound on the density), autoregressive models, and normalizing flows all fall under this category.
Note that while it is theoretically possible for 'explicit' models to go in both directions (sample and evaluate), one direction may be much more efficient than the other for certain models. e.g. for autoregressive models you can read the first two pages of  for a good explanation of why.
As shown in a figure in the section of VQGAN, VQGAN offers superior quality over VQVAE2 for a given amount of compute budget, and given that the generator of VQGAN is based on an architecture similar to VQVAE-1/2 (like DALL-E), it does not suffer from mode collapse or instability you mentioned.
edit, someone else pointed this out hours ago and provided a much more detailed answer.
The reason why VQGAN (taming transformers) has good quality (possibly the best as you said) is precisely because it uses an idea from GAN (not because of quantized VAE). So, this model is not really VAE, but it's a combination of VAE and GAN just like DC-VAE model that was also featured in the post.
If you take a look at VQGAN's section, you can see a comparison between DALL-E and VQGAN, the former of which uses substantially more resources and no GAN technique. THe latter shows much better quality, which shows that GAN really offers much better quality.