I'm assuming that the tokens used to encode an image are entirely distinct from the tokens used to encode text. Does anyone know if this is actually the case?
I would assume it has a "mode" token where it switches between text/image (and now audio), or you'd have to try to maximize the number of reserved tokens between multiple modes. GPT-4o did go from 100K to 200K vocabulary, but as far as I understand all of that vocabulary is in use for text (reducing the token cost for non-English).