I gotta be missing something here, because wasn’t “teaching a three year old to paint” (where the three year old is DALLE) the original objective in the first place? So if we’ve reduced the problem to that, it seems we’re back where we started. What’s the difference?
I meant to say that Dall-E 2's approach is closer to "teaching a three year old to paint" than the alternative methods. Instead of trying to maximize agreement to a text embedding like other methods, Dall-E 2 first predicts an image embedding (very roughly analogous to envisioning what you're going to draw before you start laying down paint), and then the decoder knows how to go from an embedding to an image (very roughly analogous to "knowing how to paint"). This is in contrast to approaches which operate by repeatedly querying "does this look like the text prompt?" as they refine the image (roughly analogous to not really knowing how to paint, but having a critic who tells you if you're getting warmer or colder).
Well, original DALL-E also worked this way. The reason the open source models use searches is that OpenAI didn't release DALL-E, but only another project called CLIP they used to sort DALL-E output by quality. It turns out CLIP could be adapted to produce images too if you used it to drive a GAN.
There is a DALL-E model available now from another company and you can use it directly (mini-DALLE or ruDALL-E), but its vocabulary is small and it can't do faces for privacy reasons.