Boorus [0] contain millions of images, manually labeled to a pretty high quality. Notably defusion models trained on booru datasets have had good success.
This is not the only example of well curated image-tag pairs, especially in artistic circles. It's just that most of them are not CC.
Booru use tags instead of captions, so a model trained on them is really limited; moreover, Danbooru has only 5 million images, while other booru such as gelbooru and sankaku have lower quality.
Tags are limited how exactly? Prompt crafting becomes a case of selecting the relevant tags, and the embedding space will still capture the dataset.
Danbooru is only one such example of well curated tagging and if we ignore copyright there are far more examples. These example just serve as evidence that refining poor labeling is not outside of the relm of possibility as you suggested.
A tag-based system would completely lack any kind of contextual information and it would not be possible to create any relationship between words; natural language is much more powerful.
An example, an image is tagged: kanna_kamui, kimono and torhu_(maiddragon), who has the kimono? Kanna, Torhu or both? It cannot be known, but with natural language it is possible to describe who is wearing what.
I think this is mainly theoretical at this point. In my experience, current technology doesn’t seem to be utilizing the additional information that comes from natural language all that well. For example: prompt Dalle2 for “a dinner plate on a stack of pancakes” and you will get ordinary images of pancakes on plates, not the other way around.
Edit: an experiment comparing tags/BOW vs natural language sequences in image generation tasks would be interesting to see.
I think this does not work mainly because of the unusual situation you describe, such as "a horse riding a person"; most of the time Dalle 2 is really good at following the prompt.
This is not the only example of well curated image-tag pairs, especially in artistic circles. It's just that most of them are not CC.
[0]: https://en.wiktionary.org/wiki/booru