Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: Emu2 – A Gemini-like open-source 37B Multimodal Model
153 points by BAAIBeijing 9 months ago | hide | past | favorite | 27 comments
Hello HN, I'm excited to introduce Emu2, the latest generative multimodal model developed by the Beijing Academy of Artificial Intelligence (BAAI). Emu2 is an open-source initiative that reflects BAAI's commitment to fostering open, secure, and responsible AI research. It's designed to enhance AI's proficiency in handling tasks across various modalities with minimal examples and straightforward instructions.

Emu2 has demonstrated superior performance over other large-scale models like Flamingo-80B in few-shot multimodal understanding tasks. It serves as a versatile base model for developers, providing a flexible platform for crafting specialized multimodal applications.

Key features of Emu2 include:

- A more streamlined modeling framework than its predecessor, Emu.

- A decoder capable of reconstructing images from the encoder's semantic space.

- An expansion to 37 billion parameters, boosting both capabilities and generalization.

BAAI has also released fine-tuned versions, Emu2-Chat for visual understanding and Emu2-Gen for visual generation, which stand as some of the most powerful open-source models available today.

Here are the resources for those interested in exploring or contributing to Emu2:

- Project: https://baaivision.github.io/emu2/

- Model: https://huggingface.co/BAAI/Emu2

- Code: https://github.com/baaivision/Emu/tree/main/Emu2

- Demo: https://huggingface.co/spaces/BAAI/Emu2

- Paper: https://arxiv.org/abs/2312.13286

We're eager to see how the HN community engages with Emu2 and we welcome your feedback to help us improve. Let's collaborate to push the boundaries of multimodal AI!




Thank you for posting this announce!

I tried to find under which license this work is available, but it does not appear to be stated anywhere. Without the license being clear, it's somewhat hard for people in the US to make use of it. It would be great, if the license terms could be clarified.


We have updated the license in the github repository. Please feel free to check it out.


Thanks!


To be more specified, our codes follows the license of Apache 2.0 and the model weights follows the license of Llama 1 since our model is initialized from Llama1


The current model weights are distributed under the GPLv3 license. The authors are in the process of releasing a new series of Emu2 models that will be accompanied by a more flexible licensing arrangement. Stay tuned for updates.


Applying GPL to model weights is very odd indeed.


We have updated the license for Apache-2.0 in the github repository. Please feel free to check it out.


To be more specified, our codes follows the license of Apache 2.0 and the model weights follows the license of Llama 1 since our model is initialized from Llama1


It might be helpful to include this in the README.


We have updated the license for Apache-2.0 in the github repository. Please feel free to check it out.


To be more specified, our codes follows the license of Apache 2.0 and the model weights follows the license of Llama 1 since our model is initialized from Llama1


I’m always in favor of more open-source models, and especially multimodal ones. I feel that the end goal for these models will be multimodal so it’s important that open-source don’t just focus on text based llms.

Thank you for your efforts!


This is really cool, thanks for sharing this! I tried a few tests and it did pretty well. But it translated 热闹 as "hot and noisy" with this prompt: Give some example sentences using 热闹 with translation.

>这个城市是一个热闹的城市。 (Zhège chéngshì shì yígè rènao de chéngshì.) - This city is a hot and noisy city.

That seems like a very literal translation, never seen that before.


Emu2 is primarily trained on the English corpus. Hence, it may not perform well as expected when dealing with tasks like translation.


The first question I ask myself when I see any new model: What's the license?

Edit: Also couldn't find a license.


To be more specified, our codes follows the license of Apache 2.0 and the model weights follows the license of Llama 1 since our model is initialized from Llama1


We have updated the license in the github repository. Please feel free to check it out.


Can't wait when emu2 incorporates techniques from apple paper https://arxiv.org/pdf/2312.11514.pdf


The examples look very promising but the queue time of the huggingface demo is a bit long I actually got a timeout error :(


You can try another demo link besides the huggingface one. It is technically more faster since the frontend is deployed in the same region as backend, which greatly reduce the network communication time.


Wow, the other link is so much faster. Almost the same speed as the much weaker llava demo on huggingface.


Wow! It is really amazing that such a capable large-scale multimodal model is open-sourced. Thank you for sharing this.


What is the chat template? I can't find it in the model card or model details.


Thanks for your interest to our work. We will release chat template in few days.


Does Emu2 have better Chinese content and translation than say ChatGPT or LLAMA?


Emu2 is primarily trained on the English corpus, which may result in suboptimal performance when dealing with inputs in other languages.


Is it similar to OpenAI GPT-4-Vision model?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: