> The most impressive thing is how little training data this took. Yes, this is ... | Hacker News

Hacker News new | past | comments | ask | show | jobs | submit

login

MacsHeadroom on April 19, 2023 | parent | context | favorite | on: LLaVA: Large Language and Vision Assistant

> The most impressive thing is how little training data this took.

Yes, this is just the beginning of multi-modality. Like how Alpaca was quickly overshadowed with much better finetunes with larger and improved datasets, I expect we'll see more capable (and truly open source) multi-modal models in the near future.

Interestingly, this "billion-scale corpus of images interleaved with text" was just open sourced a few days ago: https://github.com/allenai/mmc4

Looks like low hanging fruit for any team who wants to attempt to become the "Vicuna" of multi-modal transformers for a while!

og_kalu on April 19, 2023 [–]

I feel like i discovered a deep truth today. People say we need to ground language models but it turns that they're already being grounded. Text-only (before the multimodality) GPT-4's visual and spatial reasoning make so much more sense now.

Consider applying for YC's Spring batch! Applications are open till Feb 11.
Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact