hm.. I wonder which clip model they'll use. A big part of what makes DALLE-2 so good is the unreleased huge clip model. To train the diffusion prior they may need to first replicate this clip model.
Isn't the VQ-VAE/dVAE generator approach in the DALL-E models quite a bit cheaper computationally than latent diffusion models?
My understanding was that diffusion models were quite a bit more expensive, but yielded richer latent distributions and better images (for some definition of better).