We started writing this survey in early 2024, motivated by the idea that the Next Token Prediction paradigm could unify tasks across not just text but also image, audio, and video domains. At the time, works like Unified-IO received far less attention compared to language-focused LLMs.
Fast forward to early 2025, we’re seeing more unified LMMs like Chameleon, Transfusion, and Grok (based on their respective descriptions) that integrate understanding and generation within a single model. This could indeed be the year where LLMs handle multimodal tasks—text, image, audio, and video—seamlessly.
We started writing this survey in early 2024, motivated by the idea that the Next Token Prediction paradigm could unify tasks across not just text but also image, audio, and video domains. At the time, works like Unified-IO received far less attention compared to language-focused LLMs.
Fast forward to early 2025, we’re seeing more unified LMMs like Chameleon, Transfusion, and Grok (based on their respective descriptions) that integrate understanding and generation within a single model. This could indeed be the year where LLMs handle multimodal tasks—text, image, audio, and video—seamlessly.
Any suggestions and discussions are welcome :)
Survey: https://arxiv.org/abs/2412.18619
GitHub (related work & code): https://github.com/LMM101/Awesome-Multimodal-Next-Token-Pred...