AlphaGo and AlphaZero were able to achieve superhuman performance due to the availability of perfect simulators for the game of Go. There is no such simulator for the real world we live in (although pure LLMs sort of learn a rough, abstract representation of the world as perceived by humans.) Sora is an attempt to build such a simulator using deep learning.
“Our results suggest that scaling video generation models is a promising path towards building general purpose simulators of the physical world.”
General, superhuman robotic capabilities on the software side can be achieved once such a simulator is good enough. (Whether that can be achieved with this approach is still not certain.)
Why superhuman? Larger context length than our working memory is an obvious one, but there will likely be other advantages such as using alternative sensory modalities and more granular simulation of details unfamiliar to most humans.
Really interesting how this goes against my intuition. I would have imagined that it's infinitely easier to analyze a camera stream of the real world, then generate a polygonal representation of what you see (like you would do for a videogame) and then make AI decisions for that geometry. Instead the way that AI is going they rather skip it all and work directly on pixel data. Understanding of 3d geometry, perspective and physics is expected to evolve naturally from the training data.
> then generate a polygonal representation of what you see
It's really not that surprising since, to be honest, meshes suck.
They're pretty general graphs but to actually work nicely they have to have really specific topological characteristics. Half of the work you do with meshes is repeatedly coaxing them back into a sane topology after editing them.
There is a perfect simulator of the real world available. It can be recorded with a camera! Once the researchers have a bit of time to get their bearings and figure out how to train an order of magnitude faster we'll get there.
Why superhuman? Larger context length than our working memory is an obvious one, but there will likely be other advantages such as using alternative sensory modalities and more granular simulation of details unfamiliar to most humans.