> The most impressive thing is how little training data this took.
Yes, this is just the beginning of multi-modality. Like how Alpaca was quickly overshadowed with much better finetunes with larger and improved datasets, I expect we'll see more capable (and truly open source) multi-modal models in the near future.
Interestingly, this "billion-scale corpus of images interleaved with text" was just open sourced a few days ago: https://github.com/allenai/mmc4
Looks like low hanging fruit for any team who wants to attempt to become the "Vicuna" of multi-modal transformers for a while!
I feel like i discovered a deep truth today. People say we need to ground language models but it turns that they're already being grounded. Text-only (before the multimodality) GPT-4's visual and spatial reasoning make so much more sense now.
Yes, this is just the beginning of multi-modality. Like how Alpaca was quickly overshadowed with much better finetunes with larger and improved datasets, I expect we'll see more capable (and truly open source) multi-modal models in the near future.
Interestingly, this "billion-scale corpus of images interleaved with text" was just open sourced a few days ago: https://github.com/allenai/mmc4
Looks like low hanging fruit for any team who wants to attempt to become the "Vicuna" of multi-modal transformers for a while!