This is such a natural extension to LLMs. I’m shocked it hasn’t been tried before.
When I ask a diffusion model to generate a chessboard, I’d expect the pieces to be placed randomly. We are getting closer to image generators that not only know what chess pieces look like but also where to place them.
Stupid question: is their 7B model available? Is there public inference code that we could run? Or do they not usually release them along with these kinds of papers?
Doesn't appear to be any weights uploaded anywhere that I can find.
There are the starts of two (non-original-author) public implementations available on Github, but again -- doesn't appear to be any pretrained weights in either.
this is somewhat similar, but diffusion transformers typically use a pre-trained text model as the text conditioning whereas, in this case it's integrated and trained together multimodally.
When I ask a diffusion model to generate a chessboard, I’d expect the pieces to be placed randomly. We are getting closer to image generators that not only know what chess pieces look like but also where to place them.