Hacker News new | past | comments | ask | show | jobs | submit login
LLaVA: Large Language and Vision Assistant (llava-vl.github.io)
206 points by MacsHeadroom on April 18, 2023 | hide | past | favorite | 54 comments




Nice example!

LLaVA's language model appears to be weaker than GPT-4's.

And we've already seen the featured GPT-4 demo where it explains a visual gag with a modified phone charging cable. So GPT 4 might get this.

On the other hand, this test is almost certainly in GPT 4's training set. And I've seen GPT 4 correctly answer questions that appear on only a half dozen pages on the entire web. And since the Obama scale test is from 2012, there's a decent chance GPT 4 might just remember the answer.



it didnt, the twitter post is popular and thus surely inadverently part of the training set


“Inadvertently”


Is that still a relevant test, if the blog post is used in the training data? How can I know it is or isn't included anyways?


I think you've highlighted the true issue. How do you know the model just doesn't have a glimpse of just about everything and looks that up when you ask about it?


Some variations. https://imgur.com/a/nJGcwMO

I tried giving it more hints, but didn't seem to do anything useful.


It came close in both tests, but didn't recognize that Obama was tipping the scale

To be honest I didn't either, and only noticed it after reading the 2012 Karpathy challenge

> (...)

> You recognize that Obama has his foot positioned just slightly on top of the scale. Notice the language I’m using: It is in terms of the 3D structure of the scene, not the position of the leg in the 2D coordinate system of the image.

> You know how physics works: Obama is leaning in on the scale, which applies a force on it. Scale measures force that is applied on it, that’s how it works => it will over-estimate the weight of the person standing on it.

> The person measuring his weight is not aware of Obama doing this. You derive this because you know his pose, you understand that the field of view of a person is finite, and you understand that he is not very likely to sense the slight push of Obama’s foot.

> You understand that people are self-conscious about their weight. You also understand that he is reading off the scale measurement, and that shortly the over-estimated weight will confuse him because it will probably be much higher than what he expects. In other words, you reason about implications of the events that are about to unfold seconds after this photo was taken, and especially about the thoughts and how they will develop inside people’s heads. You also reason about what pieces of information are available to people.

> There are people in the back who find the person’s imminent confusion funny. In other words you are reasoning about state of mind of people, and their view of the state of mind of another person. That’s getting frighteningly meta.

> Finally, the fact that the perpetrator here is the president makes it maybe even a little more funnier. You understand what actions are more or less likely to be undertaken by different people based on their status and identity.

ChatGPT got only the points above those (the "(...)")


Can someone with access to gpt-4 multimodal test if it does?



Personally I'd like to see the response it gave, not just a tweet saying it passed the test without showing the results of the test.


It’s not the Obama scale picture though?

EDIT: I see now the link is different than the thumbnail in the post. Super impressive that GPT4 has solved this.


read the op's question and karpathy's reply. ignore the example in the image.


Wow. So far this seems better than mini-GPT4. The most impressive thing is how little training data this took. and the fact that you only need CLIP and a linear projection is wild. at least with mini-GPT-4, you could say that BLIP-2 already translates RGB into text token space and maybe a simple linear projection should work. But that you can project the image encoding directly...wild.

EDIT - Just found this paper that demonstrated this a few months ago (that somehow language and vision models learn representations similar enough that linear projection is enough) https://arxiv.org/abs/2209.15162


> The most impressive thing is how little training data this took.

Yes, this is just the beginning of multi-modality. Like how Alpaca was quickly overshadowed with much better finetunes with larger and improved datasets, I expect we'll see more capable (and truly open source) multi-modal models in the near future.

Interestingly, this "billion-scale corpus of images interleaved with text" was just open sourced a few days ago: https://github.com/allenai/mmc4

Looks like low hanging fruit for any team who wants to attempt to become the "Vicuna" of multi-modal transformers for a while!


I feel like i discovered a deep truth today. People say we need to ground language models but it turns that they're already being grounded. Text-only (before the multimodality) GPT-4's visual and spatial reasoning make so much more sense now.


I chose the example image with a man in a yellow shirt doing something next to a yellow car, with a yellow taxi driving in the background. I replaced the example question with my own:

Q: How many cars are in the picture?

A: There is a single car visible in the image.

Q: Is that true?

A: Yes, there is only one car visible in the image.

Q: How many yellow objects are in the image?

A: There are two yellow objects in the image.

Q: What are they?

A: One of the yellow objects is a man holding onto a clothesline while standing on a car. The other yellow object is a suitcase on the same car.

Every single answer was wrong.


Even better:

  P: How many monkeys with yellow shirt are in this photo?  
  A: There are two monkeys wearing yellow shirts in the photo.
  P: But there is no monkeys.
  A: I'm sorry, but the image description contains an error. There are no monkeys in the photo. Instead, there are two men wearing yellow shirts.


I did something similar with the image containing the dock. It said there was three planks in that image...


Feels like these large models are the pinnacle of ML. These will be built for various sensors/inputs and verticals. Then what? Can we go another level up with the tech we have?


I think of these as the missing glue that other ML/agent/expert systems lacked, but they also lack the ability of those systems. The true value will be when all the AI techniques are built as an ensemble system in a feedback loop with these generative models in the middle.


Like JARVIS, HuggingGPT, AutoGPT, AgentGPT, BabyAGI, etc.?

Those are quite cool too.


Things are only just getting started, and this is a relatively small model (13B).


A much better name than Mini-GPT4, in my opinion. Hopefully new cool projects that come out follow their example :)


I wish projects like these would mention about the minimum GPU VRAM one needs, for inference or fine-tuning.


This is simply a finetune of LLaMA-13B, so it has the same ~11GB RAM or VRAM requirement as 13B for inferencing.

With 4bit LoRAs you can finetune 13B with 12GB of VRAM. On a 3090 it can take as little as a couple of hours.


> LLaVa connects pre-trained CLIP ViT-L/14 visual encoder and large language model LLaMA, using a simple projection matrix.

The fact that it is ”this simple” to connect foundation models is astounding


I tried a few examples using https://commons.wikimedia.org/wiki/Special:Random/Image and it was pretty good at captioning photos. It definitely bullshitted some, making up some elements that weren't there. Might be a decent way to do default alt text for photo uploads, although that would amplify the citogenesis problem.


If I understand the image processing part [1] of the code correctly, each image is scaled down (max 448x224) and only uses <= 512 tokens. 14x14 pixels is one input token.

[1] https://github.com/haotian-liu/LLaVA/blob/ee175feb646f4673fe...


Delta weights are on Hugging Face: https://huggingface.co/liuhaotian/LLaVA-13b-delta-v0

More technical details on the GitHub: https://github.com/haotian-liu/LLaVA


Seems better with abstract stuff than specifics, at least in the very limited tests I've done:

https://imgur.com/a/tJ8VRa5

It seems to think a Seal is a Penguin, although it is front-on and arguably could be a chick based off the colour, but...


specifics work too if they're targeted. For instance, asking "what's the color of the boy's shirt ?", "what of his shorts ?" all return accurate results even if those details are missed in a general, "describe this image" request.

https://imgur.com/a/4KLRDUv


This is the same approach as Mini-GPT4 of training a linear layer using frozen image encoder and text decoder. the difference is that they use CLIP instead of BLIP, and LLaMA instead of Vicuna (which was trained on LLaMA. Interesting that they both came out at the same time.


It says they used Vicuna


the paper says llama:

> We develop a large multimodal model (LMM), by connecting the open-set visual encoder of CLIP [36] with the language decoder LLaMA, and fine-tuning them end-to-end on our generated instructional vision-language data


>> and setting a new state-of-the-art accuracy on Science QA.

Given the size of these things, would there be any more actual information in it than say Wikipedia and maybe some other sources? In other words, is it just able to summarize and answer questions about its training data?


Stable Diffusion is 2GB and can produce just about any image imaginable with the right prompt. LLMs are similar. Size is not as important as the number of useful paths through the weights. With billions of parameters, the number of paths in a well trained model is almost intractably huge. The training methods are more important than the size of the model. Don't be mislead by their size.


Don't give it yearbook photos, it seems to hallucinate things in them. At least it identified Paul Rudd pretty easily!

> In the image, a man is smiling at the camera, looking somewhat similar to the character played by Paul Rudd in the movie Clueless. The man has curly hair and is wearing a plaid shirt. He is surrounded by a few other people, who are standing around him in various poses. The photo seems to capture a social gathering or event where everyone is having a good time and enjoying each other's company.

https://imgur.com/a/kfTgkQI


Yes, it is a gotcha, but oh boy oh boy.

https://imgur.com/a/His3UEj


This is my first time trying a multimodal model. I fed it some photographs I have taken and asked for responses. Here are the results:

Prompt: Please describe this image to me and tell me how people seeing the image might respond to it.

https://www.gally.net/sleepers/2004-5winter/07.html

LLaVA: In the image, there is a person lying down on a sidewalk, possibly taking a nap or simply resting. This person is wearing a black jacket and has a beard. The sidewalk appears to be part of a street with several cars parked along it, as well as benches located nearby.

People seeing this image might respond in various ways. Some may feel concerned about the person lying on the sidewalk, wondering if they are homeless, in need of assistance, or simply taking a break. Others might not pay much attention to the person and continue with their daily activities. The presence of cars and benches in the scene suggests that it is an urban setting, where people are accustomed to seeing different things happening, so the sight of someone resting on the sidewalk might not be particularly surprising or alarming to most passersby.

Prompt: Describe this photograph to me. Where might the photograph have been taken? Why might a person have taken this photograph?

https://www.gally.net/posters/17.html

LLaVA: The image features a building with a political poster plastered on its side wall. In front of the building, a bicycle is parked, capturing the viewer's attention. The bike is leaning against the wall, possibly providing convenient transportation for someone who stopped to take the photograph.

The photograph may have been taken in an urban area where street art and political posters are common, or perhaps near a public building where such posters are displayed. The person might have captured this moment to highlight the juxtaposition between the bicycle and the political poster, or simply as a candid snapshot of their surroundings.

Prompt: Can you read any of the text in the image?

LLaVA: Unfortunately, as an AI, I cannot read text in images. However, the image features a political poster, and it is likely that the content of the poster is related to an upcoming election or some other political event or issue.

Prompt: Describe the scene in this photograph to me. Where do you think it was taken? What might have been the occasion?

https://www.gally.net/halloweeninshibuya/5.jpg

LLaVA: The image shows a large crowd of people gathered in a city, possibly in an Asian metropolis. The scene is bustling with activity, as people are seen standing, walking, and interacting with each other in a busy street. Many individuals in the crowd are wearing face paint, suggesting that they might be supporting a team or attending a special event.

There are several handbags and backpacks visible among the crowd, indicating that these people might be tourists or commuters. The presence of traffic lights in the area implies that the gathering is taking place in a well-developed urban setting. Overall, the atmosphere in the image is lively and energetic, with people coming together to celebrate or participate in an event.

[Note: The photograph was taken in Shibuya, Tokyo, on Halloween 2015.]


Those are some incredible results!

You can do more than ask it to describe the picture.

For example: "Write a short fictional story about what the sleeping man in this picture is dreaming about."

https://www.gally.net/sleepers/2004-5winter/07.html

LLaVA: "In the image, a bearded man wearing a jacket is laying down on a sidewalk, seemingly taking a nap. As he dreams, he finds himself transported to a tropical island where he is surrounded by lush greenery and the sound of waves crashing on the shore. He can feel the warm sunshine on his face and the soft breeze rustling through the palm trees. In this dream, the man is completely relaxed and carefree, enjoying the serene atmosphere and the beauty of the island."


pity these can't count objects in images (a limitation of CLIP ?). it would be useful


It seems to be able to count though ? What did you try ? You have to specifically how many x there are in the image. X probably can't be generic I tried. I've tried "how many dogs?" And "how many animals?"

Maybe it wouldn't work with high numbers


it counted 14 people while there were more than 100


Is this related to MiniGPT4 as a project? (I know both are multimodal and rely on Vicuna)

https://news.ycombinator.com/item?id=35598281


It's not related


I see. Interesting timing!


Yields pretty funny results with my usual test https://aus.social/@s_mcleod/110223104698463219


I'll be honest, I have no idea what is going on in that picture. Appears to be a man in some kind of inflated rubber balloon or something.


The man is actor Jim Carey, it’s a frame from a scene in the Ace Ventura Pet Detective movies. A rhino is giving birth to him.


Bingo


(All using LLaVa 13B v0, defaults of 0.7 temp, 512 output tokens)

I put in these three images (warning! First one is a big spider!) https://imgur.com/a/PgQT202

For the spider I asked “What kind of spider is this? Is it dangerous?”

It replied “The image features a large spider on a small bricks on a wall, with a close-up view of the spider in the center of the frame. It is a creepy crawly spider, as some might describe it. The spider appears to be walking on or sitting on the bricks, and seems to be doing some exploring outside. While the presence of a large spider may not be everyone's favorite, it is a part of the natural world and contributes to the diversity of species in the image taken with a Nikon Coolpix, capturing the small details of the spider and its surroundings.”

Critique: I believe it’s a wolf spider, though I would also believe trapdoor spider and maybe black house spider. Initially I thought it was a funnel-web spider, which is why I took a picture. The general Australian wisdom would be “looks similar enough to a funnel-web, treat it as very dangerous, just in case”. Taken with a zoomed in iPhone, not a Nikon Coolpix. It is obviously not exploring outside.

For the mushroom I asked “What is this?”

It replied “The image shows a close-up view of a rock on the ground, surrounded by grass and other plants. It appears to be on a rocky path with some leafy plants nearby. This scene might be in a natural outdoor setting or even a garden.”

Critique: it’s a big solitary mushroom, at night, lit by a powerful torch. The surroundings are correctly identified (gravel path and lightly hardened natural outdoor setting).

For the screenshot I asked “What’s so funny about this screenshot?”

It replied “In the image, there are two posts on a social media platform, both containing text and appearing as screenshots of text messages. These posts seem to be part of a conversation between two people. One post is in a foreign language, and the other is in English. The latter is a response to the former, possibly discussing the content of the initial message.”

Critique: not a conversation, latter is not a response to the former, they are merely both discussing the same topic. Neither post is in a foreign language. No attempt to address why it might be funny. (I would accept any attempt even if it’s wrong. Something like “one post says something won’t happen, while right next to it, there is another post reporting it has happened” would get partial credit. Full credit would have to acknowledge a few of the incongruities: People worried about “runaway agi” are concerned about catastrophic risk, but it is intentionally misunderstood to mean eloping with the AGI as a romantic partner, rendering the reassurance worthless; the claim that it has already happened likewise appears to be misunderstanding as well since it is reported in the manner of a pet running away, a somewhat different scale of tragedy and incongruous because we don’t think of AGI as a pet.)

Overall not super impressed with the quality so far. Describing images is impressive for sure, but subtleties, humour, and being able to draw on common wisdom are some of the main appeals of “multi-modal GPT“ and this is lacking that spirit.


[dead]


That is what rain sounds like




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: