Hacker News new | past | comments | ask | show | jobs | submit login
Next-Token Prediction Is All Your Need:Emu3 Beats SDXL, LLaVA-1.6, and OpenSora
1 point by BAAIBeijing 4 months ago | hide | past | favorite
Emu3 is a new suite of state-of-the-art multimodal models trained solely with next-token prediction. By tokenizing images, text, and videos into a discrete space, we train a single transformer from scratch on a mixture of multimodal sequences. Emu3 outperforms several well-established task-specific models in both generation and perception tasks, surpassing flagship models such as SDXL and LLaVA-1.6, while eliminating the need for diffusion or compositional architectures. Emu3 is also capable of generating high-fidelity video via predicting the next token in a video sequence. We simplify complex multimodal model designs by converging on a singular focus: tokens, unlocking great potential for scaling both during training and inference. Our results demonstrate that next-token prediction is a promising path towards building general multimodal intelligence beyond language.We open-source key techniques and models to support further research in this direction.

Projcet page:https://emu.baai.ac.cn/about




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: