LLaVA's language model appears to be weaker than GPT-4's.
And we've already seen the featured GPT-4 demo where it explains a visual gag with a modified phone charging cable. So GPT 4 might get this.
On the other hand, this test is almost certainly in GPT 4's training set. And I've seen GPT 4 correctly answer questions that appear on only a half dozen pages on the entire web. And since the Obama scale test is from 2012, there's a decent chance GPT 4 might just remember the answer.
I think you've highlighted the true issue. How do you know the model just doesn't have a glimpse of just about everything and looks that up when you ask about it?
It came close in both tests, but didn't recognize that Obama was tipping the scale
To be honest I didn't either, and only noticed it after reading the 2012 Karpathy challenge
> (...)
> You recognize that Obama has his foot positioned just slightly on top of the scale. Notice the language I’m using: It is in terms of the 3D structure of the scene, not the position of the leg in the 2D coordinate system of the image.
> You know how physics works: Obama is leaning in on the scale, which applies a force on it. Scale measures force that is applied on it, that’s how it works => it will over-estimate the weight of the person standing on it.
> The person measuring his weight is not aware of Obama doing this. You derive this because you know his pose, you understand that the field of view of a person is finite, and you understand that he is not very likely to sense the slight push of Obama’s foot.
> You understand that people are self-conscious about their weight. You also understand that he is reading off the scale measurement, and that shortly the over-estimated weight will confuse him because it will probably be much higher than what he expects. In other words, you reason about implications of the events that are about to unfold seconds after this photo was taken, and especially about the thoughts and how they will develop inside people’s heads. You also reason about what pieces of information are available to people.
> There are people in the back who find the person’s imminent confusion funny. In other words you are reasoning about state of mind of people, and their view of the state of mind of another person. That’s getting frighteningly meta.
> Finally, the fact that the perpetrator here is the president makes it maybe even a little more funnier. You understand what actions are more or less likely to be undertaken by different people based on their status and identity.
ChatGPT got only the points above those (the "(...)")
Wow. So far this seems better than mini-GPT4. The most impressive thing is how little training data this took. and the fact that you only need CLIP and a linear projection is wild. at least with mini-GPT-4, you could say that BLIP-2 already translates RGB into text token space and maybe a simple linear projection should work. But that you can project the image encoding directly...wild.
EDIT - Just found this paper that demonstrated this a few months ago (that somehow language and vision models learn representations similar enough that linear projection is enough) https://arxiv.org/abs/2209.15162
> The most impressive thing is how little training data this took.
Yes, this is just the beginning of multi-modality. Like how Alpaca was quickly overshadowed with much better finetunes with larger and improved datasets, I expect we'll see more capable (and truly open source) multi-modal models in the near future.
Interestingly, this "billion-scale corpus of images interleaved with text" was just open sourced a few days ago: https://github.com/allenai/mmc4
Looks like low hanging fruit for any team who wants to attempt to become the "Vicuna" of multi-modal transformers for a while!
I feel like i discovered a deep truth today. People say we need to ground language models but it turns that they're already being grounded. Text-only (before the multimodality) GPT-4's visual and spatial reasoning make so much more sense now.
I chose the example image with a man in a yellow shirt doing something next to a yellow car, with a yellow taxi driving in the background. I replaced the example question with my own:
Q: How many cars are in the picture?
A: There is a single car visible in the image.
Q: Is that true?
A: Yes, there is only one car visible in the image.
Q: How many yellow objects are in the image?
A: There are two yellow objects in the image.
Q: What are they?
A: One of the yellow objects is a man holding onto a clothesline while standing on a car. The other yellow object is a suitcase on the same car.
P: How many monkeys with yellow shirt are in this photo?
A: There are two monkeys wearing yellow shirts in the photo.
P: But there is no monkeys.
A: I'm sorry, but the image description contains an error. There are no monkeys in the photo. Instead, there are two men wearing yellow shirts.
Feels like these large models are the pinnacle of ML. These will be built for various sensors/inputs and verticals. Then what? Can we go another level up with the tech we have?
I think of these as the missing glue that other ML/agent/expert systems lacked, but they also lack the ability of those systems. The true value will be when all the AI techniques are built as an ensemble system in a feedback loop with these generative models in the middle.
I tried a few examples using https://commons.wikimedia.org/wiki/Special:Random/Image and it was pretty good at captioning photos. It definitely bullshitted some, making up some elements that weren't there. Might be a decent way to do default alt text for photo uploads, although that would amplify the citogenesis problem.
If I understand the image processing part [1] of the code correctly, each image is scaled down (max 448x224) and only uses <= 512 tokens. 14x14 pixels is one input token.
specifics work too if they're targeted. For instance, asking "what's the color of the boy's shirt ?", "what of his shorts ?" all return accurate results even if those details are missed in a general, "describe this image" request.
This is the same approach as Mini-GPT4 of training a linear layer using frozen image encoder and text decoder. the difference is that they use CLIP instead of BLIP, and LLaMA instead of Vicuna (which was trained on LLaMA. Interesting that they both came out at the same time.
> We develop a large multimodal model (LMM), by connecting the open-set visual encoder of CLIP [36] with the language decoder LLaMA, and fine-tuning them end-to-end on our generated instructional vision-language data
>> and setting a new state-of-the-art accuracy on Science QA.
Given the size of these things, would there be any more actual information in it than say Wikipedia and maybe some other sources? In other words, is it just able to summarize and answer questions about its training data?
Stable Diffusion is 2GB and can produce just about any image imaginable with the right prompt. LLMs are similar. Size is not as important as the number of useful paths through the weights. With billions of parameters, the number of paths in a well trained model is almost intractably huge. The training methods are more important than the size of the model. Don't be mislead by their size.
Don't give it yearbook photos, it seems to hallucinate things in them. At least it identified Paul Rudd pretty easily!
> In the image, a man is smiling at the camera, looking somewhat similar to the character played by Paul Rudd in the movie Clueless. The man has curly hair and is wearing a plaid shirt. He is surrounded by a few other people, who are standing around him in various poses. The photo seems to capture a social gathering or event where everyone is having a good time and enjoying each other's company.
LLaVA: In the image, there is a person lying down on a sidewalk, possibly taking a nap or simply resting. This person is wearing a black jacket and has a beard. The sidewalk appears to be part of a street with several cars parked along it, as well as benches located nearby.
People seeing this image might respond in various ways. Some may feel concerned about the person lying on the sidewalk, wondering if they are homeless, in need of assistance, or simply taking a break. Others might not pay much attention to the person and continue with their daily activities. The presence of cars and benches in the scene suggests that it is an urban setting, where people are accustomed to seeing different things happening, so the sight of someone resting on the sidewalk might not be particularly surprising or alarming to most passersby.
Prompt: Describe this photograph to me. Where might the photograph have been taken? Why might a person have taken this photograph?
LLaVA: The image features a building with a political poster plastered on its side wall. In front of the building, a bicycle is parked, capturing the viewer's attention. The bike is leaning against the wall, possibly providing convenient transportation for someone who stopped to take the photograph.
The photograph may have been taken in an urban area where street art and political posters are common, or perhaps near a public building where such posters are displayed. The person might have captured this moment to highlight the juxtaposition between the bicycle and the political poster, or simply as a candid snapshot of their surroundings.
Prompt: Can you read any of the text in the image?
LLaVA: Unfortunately, as an AI, I cannot read text in images. However, the image features a political poster, and it is likely that the content of the poster is related to an upcoming election or some other political event or issue.
Prompt: Describe the scene in this photograph to me. Where do you think it was taken? What might have been the occasion?
LLaVA: The image shows a large crowd of people gathered in a city, possibly in an Asian metropolis. The scene is bustling with activity, as people are seen standing, walking, and interacting with each other in a busy street. Many individuals in the crowd are wearing face paint, suggesting that they might be supporting a team or attending a special event.
There are several handbags and backpacks visible among the crowd, indicating that these people might be tourists or commuters. The presence of traffic lights in the area implies that the gathering is taking place in a well-developed urban setting. Overall, the atmosphere in the image is lively and energetic, with people coming together to celebrate or participate in an event.
[Note: The photograph was taken in Shibuya, Tokyo, on Halloween 2015.]
LLaVA: "In the image, a bearded man wearing a jacket is laying down on a sidewalk, seemingly taking a nap. As he dreams, he finds himself transported to a tropical island where he is surrounded by lush greenery and the sound of waves crashing on the shore. He can feel the warm sunshine on his face and the soft breeze rustling through the palm trees. In this dream, the man is completely relaxed and carefree, enjoying the serene atmosphere and the beauty of the island."
It seems to be able to count though ? What did you try ? You have to specifically how many x there are in the image. X probably can't be generic I tried. I've tried "how many dogs?" And "how many animals?"
For the spider I asked “What kind of spider is this? Is it dangerous?”
It replied “The image features a large spider on a small bricks on a wall, with a close-up view of the spider in the center of the frame. It is a creepy crawly spider, as some might describe it. The spider appears to be walking on or sitting on the bricks, and seems to be doing some exploring outside. While the presence of a large spider may not be everyone's favorite, it is a part of the natural world and contributes to the diversity of species in the image taken with a Nikon Coolpix, capturing the small details of the spider and its surroundings.”
Critique: I believe it’s a wolf spider, though I would also believe trapdoor spider and maybe black house spider. Initially I thought it was a funnel-web spider, which is why I took a picture. The general Australian wisdom would be “looks similar enough to a funnel-web, treat it as very dangerous, just in case”. Taken with a zoomed in iPhone, not a Nikon Coolpix. It is obviously not exploring outside.
For the mushroom I asked “What is this?”
It replied “The image shows a close-up view of a rock on the ground, surrounded by grass and other plants. It appears to be on a rocky path with some leafy plants nearby. This scene might be in a natural outdoor setting or even a garden.”
Critique: it’s a big solitary mushroom, at night, lit by a powerful torch. The surroundings are correctly identified (gravel path and lightly hardened natural outdoor setting).
For the screenshot I asked “What’s so funny about this screenshot?”
It replied “In the image, there are two posts on a social media platform, both containing text and appearing as screenshots of text messages. These posts seem to be part of a conversation between two people. One post is in a foreign language, and the other is in English. The latter is a response to the former, possibly discussing the content of the initial message.”
Critique: not a conversation, latter is not a response to the former, they are merely both discussing the same topic. Neither post is in a foreign language. No attempt to address why it might be funny. (I would accept any attempt even if it’s wrong. Something like “one post says something won’t happen, while right next to it, there is another post reporting it has happened” would get partial credit. Full credit would have to acknowledge a few of the incongruities: People worried about “runaway agi” are concerned about catastrophic risk, but it is intentionally misunderstood to mean eloping with the AGI as a romantic partner, rendering the reassurance worthless; the claim that it has already happened likewise appears to be misunderstanding as well since it is reported in the manner of a pet running away, a somewhat different scale of tragedy and incongruous because we don’t think of AGI as a pet.)
Overall not super impressed with the quality so far. Describing images is impressive for sure, but subtleties, humour, and being able to draw on common wisdom are some of the main appeals of “multi-modal GPT“ and this is lacking that spirit.
[1] http://karpathy.github.io/2012/10/22/state-of-computer-visio... [2] https://imgur.com/LbgJvMQ