> It seems the models struggle with creating an accurate internal spatial repres...

> It seems the models struggle with creating an accurate internal spatial representation of the game state only using textual tokens

That'd be actually interesting research material for the claim that LLMs are able to build internal representations of the world. (Either they can't at all, which'd be an important insight, or it turns out there's something fundamentally different about modalities that engages different reasoning/world model capabilities, which would be even more interesting)

Or, if you want to really go wild, "what capabilities allow models to reason in modalities fundamentally different from their input data/training data".

Damn it, I should quit and go back to University. [Ed.: She wouldn't quit, she likes her job, don't believe her]