What if instead of a video game, this was trained on video and control inputs from people operating equipment like warehouse robots? Then an automated system could visualize the result of a proposed action or series of actions when operating the equipment itself. You would need a different model/algorithm to propose control inputs, but this would offer a way for the system to validate and refine plans as part of a problem solving feedback loop.
>Robotic Transformer 2 (RT-2) is a novel vision-language-action (VLA) model that learns from both web and robotics data, and translates this knowledge into generalised instructions for robotic control