Something…seems fishy? Like the example with the guy next to the robot figure. Their model happened to predict exactly the same type of figure?! Diffusion models are not omnipotent…
That's the entire point. It didn't "happen" to predict exactly the same type of figure. It used the context photos to know what type of figure it should render.
You might be getting a bit confused because here the training process has to happen every time you use it, whereas in most AI applications you only perform inference for actual use.