I'm assuming that the tokens used to encode an image are entirely distinct from ...

tempusalaria · 2024-06-07T13:59:20 1717768760

It’s probable that there is a separate vision encoder which projects the image tiles into the distribution space of the text tokenizer a la CLIP/LLava

blixt · 2024-06-07T14:01:31 1717768891

I would assume it has a "mode" token where it switches between text/image (and now audio), or you'd have to try to maximize the number of reserved tokens between multiple modes. GPT-4o did go from 100K to 200K vocabulary, but as far as I understand all of that vocabulary is in use for text (reducing the token cost for non-English).