Hacker News new | past | comments | ask | show | jobs | submit login

I'm assuming that the tokens used to encode an image are entirely distinct from the tokens used to encode text. Does anyone know if this is actually the case?



It’s probable that there is a separate vision encoder which projects the image tiles into the distribution space of the text tokenizer a la CLIP/LLava


I would assume it has a "mode" token where it switches between text/image (and now audio), or you'd have to try to maximize the number of reserved tokens between multiple modes. GPT-4o did go from 100K to 200K vocabulary, but as far as I understand all of that vocabulary is in use for text (reducing the token cost for non-English).




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: