Hacker News new | past | comments | ask | show | jobs | submit login

> The most impressive thing is how little training data this took.

Yes, this is just the beginning of multi-modality. Like how Alpaca was quickly overshadowed with much better finetunes with larger and improved datasets, I expect we'll see more capable (and truly open source) multi-modal models in the near future.

Interestingly, this "billion-scale corpus of images interleaved with text" was just open sourced a few days ago: https://github.com/allenai/mmc4

Looks like low hanging fruit for any team who wants to attempt to become the "Vicuna" of multi-modal transformers for a while!




I feel like i discovered a deep truth today. People say we need to ground language models but it turns that they're already being grounded. Text-only (before the multimodality) GPT-4's visual and spatial reasoning make so much more sense now.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: