Correct. Most image models to date have used very small text models, which are u...

Correct. Most image models to date have used very small text models, which are unable to spell many words (https://arxiv.org/abs/2108.11193) - if you cannot spell a word letter by letter because you don't know what the letters are, how are you going to generate a pixel image of each of those letters...? (It is not as if letters were intrinsically hard. GANs were generating letters easily like 7 years ago.) Their general inability is due to BPEs and non-character-based tokenization.

This was noted in the DALL-E 2 paper (https://arxiv.org/pdf/2204.06125.pdf#page=16&org=openai), and it can be experimentally established by swapping out even a very large LLM like PaLM for a humble, small, weak, but not badly-tokenized ByT5 and noting instant solution of the 'problem' (https://arxiv.org/abs/2212.10562#google). Skip to the appendix of the second paper if you have any doubts about the difference that switching to ByT5 makes in terms of spelling. The solution is just scaling up the LLM models (which is necessary to get better instruction-following and image quality in general, quite aside from spelling inside images) and eventually switching to character tokenization.* See, as always, https://gwern.net/gpt-3#bpes

(Hands and cats, however, are just genuinely difficult and require biting the bullet of scaling. And I wonder if it will take video supervision to truly solve them?)

* on a recent-news note, I suspect Claude-3 may have done something interesting with tokenization - possibly but not necessarily switching to character/byte encodings - and this is part of why it confabulates in ways unusual for ChatGPT but also is a lot more pleasant to use.