Hacker News new | past | comments | ask | show | jobs | submit login
Open Flamingo – open framework to train multimodal LLMs (laion.ai)
265 points by mpaepper on March 28, 2023 | hide | past | favorite | 25 comments

In the demo I put the obama prank photo http://karpathy.github.io/2012/10/22/state-of-computer-visio... and asked "Why is this picture funny?" and it responded "Question: Why is this picture funny? Answer: President Obama is taller than the average person."

Tbh I’m not sure why the pic is funny. And I am human

Bot detected

Is it something to do with the idiom of putting a thumb on the scale, meaning using your power to influence the outcome of something?

he's stepping on the scale making the guy seem heavier than he is

Furthermore the man on the scale is faced the other way and wouldn’t know someone is stepping on the scale. There’s an element of theory of mind there. You would have to understand that the man on the scale is unaware of Obama’s action.

The article points this out and several things we all instantly recognize.

Thanks. I was looking at the picture on my phone and it was difficult to see what was going on in the picture.

Luckily the article explains it right below the caption ;-)

But you understand why people might find it funny, and what people in the image are thinking, you have a mental model of them. That's the point.

That’s because it’s very posed so it’s not that funny.

Has this one fared better with other image prompt models? It’s a great little challenge, I’m curious for a follow up!

What's the GPT-4 answer to why that picture is funny?

> @karpathy: We tried and it solves it :O. The vision capability is very strong but I still didn't believe it could be true. The waters are muddied some by a fear that my original post (or derivative work there of) is part of the training set. More on it later.


Is there a More on it?

I always like to try these zero-shot models on things outside of the "normal" COCO classes. Here are some chess board queries:

Counting: https://imgur.com/KTuQ1Bv

Parse the chess board: https://imgur.com/2zYFK1P

(Result): https://imgur.com/Ei4MAl7

Few-Shot Object Detection (Pascal VOC): https://imgur.com/gZkDMn8

Few-Shot Object Detection (simplified): https://imgur.com/Hk8QGMd

Not quite there yet. I've been more impressed with the other new zero-shot multimodal models like Grounding DINO and Azure Dense Captioning. Really looking forward to putting multimodal GPT-4 through its paces as well.

> Parse the chess board:

Could it be that the actual issue has to do with it having trouble with small tokens (letters, numbers)?

Does it give a different result if you ask it to answer in a format like this?

> Please name what kind of piece is on each square of this board > A1: white rook > A2: white pawn > A3: empty > A4: empty > ...

Prompting can be so unintuitive sometimes. Maybe it just has an issue with the output representation or something...

Even at this scale the model's able to answer questions fairly impressively, but I created an image with some distinct shapes in different positions and it didn't go well [0]. I think however they're doing the image encoding doesn't capture positional information which, to my mind, limits a lot of use cases.

[0] https://i.postimg.cc/GtrGs8mw/Screenshot-2023-03-28-at-5-19-...

it's not the image embedding. It's the objective task. Image to text is simply not good enough. It's really lossy and the datasets are garbage so it's not very robust.

This is awesome work and they also provide their 9B OpenFlamingo model which is based on Llama:


What are the key features of Open Flamingo, and how does it compare to other frameworks for training multimodal LLMs?

What’re the techniques that’ll get this to run on a single GPU?

Most of the parameters are in the language model (LLaMa-7B). So, they'd pretty much be the same techniques that would let LLaMa run on a single GPU -- especially lower precision tricks. If you only want to run inference/forward (no training), it should be pretty doable.

You can almost definitely run it on consumer GPU if you swap out the language model for something smaller as well (although the performance would definitely not be as good on the language side).

That title is pretty impressive/ big on mobile!

It's just bad unresponsive CSS.

It's so ugly to have the words break anywhere and for the intentional line breaks to still occur anyway.

All they needed to do was use media queries for at least three screen widths and adjust the font size in there accordingly.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact