Something…seems fishy? Like the example with the guy next to the robot figure. T...

IshKebab · on Sept 29, 2023

That's the entire point. It didn't "happen" to predict exactly the same type of figure. It used the context photos to know what type of figure it should render.

You might be getting a bit confused because here the training process has to happen every time you use it, whereas in most AI applications you only perform inference for actual use.

foota · on Sept 29, 2023

The model gets the reference images as "context", so it can just copy the robot from one of the other images.

syntaxing · on Sept 29, 2023

Ahh I see, this makes a lot more sense now!

bjornlouser · on Sept 29, 2023

I wonder if he is holding that umbrella to aid the model in recovering the 3d scene/scale from the reference images.