This is the first time i'm seeing the output of a vision model and thinking "Wow this can really see". Incredibly robust. I really wonder how they pulled this one off. Can't be an image to text objective task surely.
Just as food for thought, here's how I would try to build this:
1. Train a OCR task that accumulates the text in its hidden state.
2. Train an image recognition task that accumulates the image description in its hidden state.
For both of these, the state decoder is something like the usual autoregressive transformer to convert the state to a text that can be easily scored.
You now have something like the CLIP embeddings that guide DallE / Imagen, etc. You temporarily freeze that while you train language tasks on the same hidden states. And then hopefully the AI model will unify the embeddings for OCR, recognition, and language input.
If that happens, then the attention layers in the decoder allow the model to learn a sort of internal monolog where the image is replaced by text describing it and the OCRed text.
I'm particularly interested how GPT-4 manages multi-modal processing. Do the images share the same domain as the text inputs, or is there some location in the model inputs that is ~for images only~. The Technical Report states that "the model generates text outputs given inputs consisting of arbitrarily interlaced text and image"[1], but that doesn't really clear up how the images are being treated here.
I would hope that there are plans to integrate this sort of tech into things like self driving cars. I feel like this could be one of the "missing ingredients" so to speak.
I disagree with the other comments. This is a
single frame of video, of which NO context can be derived. Anyone, not just AI can guess at "emotions" or "intents" or "attributions". It's all just weighted guesswork.