Super cool, and really nice to see the continuous rapid progress of these models! I have to wonder how long-term state (building a base and coming back later) as well as potentially guided state (e.g. game rules that are enforced in traditional code, or multiplayer, or loading saved games, etc) will work.
It's probably not by just extending the context window or making the model larger, though that will of course help, because fundamentally external state and memory/simulation are two different things (right?).
Either way it seems natural that these models will soon be used for goal-oriented imagination of a task – e.g. imagine a computer agent that needs to find a particular image on a computer, it would continuously imagine the path between what it currently sees and its desired state, and unlike this model which takes user input, it would imagine that too. In some ways, to the best of my understanding, this already happens with some robot control networks, except without pixels.
There's not even the slightest hint of state in this demo: if you hold "turn left" for a full rotation you don't end up where you started. After a few rotations the details disappear and you're left in the middle of a blank ocean. There's no way this tech will ever make a playable version of Mario, let alone Minecraft.
There's plenty of evidence of state, just a very short-term memory. Examples:
- The inventory bar is mostly consistent throughout the play
- State transitions in response to key presses
- Block breakage over time is mostly consistent
- Toggling doors / hatches works as expected
- Jumping progresses with correct physics
Turning around and seeing approximately the same thing you saw a minute ago is probably just a matter of extending a context window, but it will inherently have limits when you get to the scale of an entire world even if we somehow can make context windows have excellent compression of redundant data (which would greatly help LLM transformers too). And I guess what I'm mostly wondering about is how would you synchronize this state with a ground truth so that it can be shared between different instances of the agent, or other non-agent entities.
And again, I think it's important to remember games are just great for training this type of technology, but it's probably more useful in non-game fields such as computer automation, robot control, etc.
Hey, developer of Oasis here! You are very correct. Here are a few points:
1. We trained the model on a context window of even 30 sec. What's the problem? It barely pays any attention to frames beyond the past few ones. This certainly makes sense as it's a question of the loss function of the model during training. We are running now many different training runs to experiment with a better loss func (and datamix) to solve this issue. You'll see newer versions soon!
2. In the long term, we believe the "ultimate" solution is 2 models: 1 model that maintains game state + 1 model that turns that state into pixel. Think of it as having the first model be something resembling more of an LLM that gets the current state + user action and produces the new state, and then the second model being a diffusion model that takes from this state and maps to pixels. This would win the best of both worlds.
This stuff is all fascinating to me from a computer vision perspective. I'm curious - if you have a second model tasked with learning just the game state - does that mean you would be using info from the game itself (say, via a mod or with the developer console) as training data? Or is the idea that the model somehow learns the state (and only the state) on its own as it does here?
That's a great question -- lots of experiments will be going into the future versions o Oasis. There are quite a few different possibilities here and we'll have to experiment with them a lot.
The nice thing is that we can run tons of experiments at once. For Oasis v1, we ran over 1000 experiments (end-to-end training a 500M model) on the model arch, datamix, etc., before we created the final checkpoint that's deployed on the site. At Decart (we just came out of stealth yesterday: https://www.theinformation.com/articles/why-sequoias-shaun-m...) we have 2 teams: Decart Infrastructure and Decart Experiences. The first team provides insanely fast infra for training/inferencing (writes from scratch everything from CUDA to redoing the python garbage collector) -- we are able to get a 500M model to converge during training in ~20h instead of 1-2 weeks. Then, Decart Experiences uses this infra to create these new types of end-to-end "Generated Experiences"
Nah, it doesn't even track which direction you're looking. Looking straight ahead, walk into some sugar cane so your whole screen is green. Now look up. It thinks you were looking down.
I guess it comes down to your definition of state. I'm not saying there's enough state for this to be playable, but there is clearly state and I think it's important to point out how impressive the amount of temporal consistence and coherence this model is capable of, considering not long ago the state of the art here rapidly decohered into completely noisy pixels.
I guess if you consider knowing what color the pixels were in the last frame "state". That's not a definition anyone would use though. Spinning around and have the world continuously regenerate or looking at the sky and back down regenerating randomly is the opposite of state. It's complete incoherence.
Just a thought - complete incoherence is a noise function, no? Successive frames here are far more correlated than that, which is pretty remarkable.
My definition of state is something like reified bits of information, for which previous frames and such certainly count (knowing the current frame tells you a lot of information about the next frame vs not knowing the current frame).
Yeah probably, it remains to be seen if these models can actually help guide a live session towards the goal. At least it's been shown that these types of world models can help a model become better at achieving a goal, in lieu of a hard coded simulation environment or the real world, for when those options are not tractable.
My favorite example is: https://worldmodels.github.io/
(Not least of all because they actually simulate these simplified world models in your browser!)
>There's no way this tech will ever make a playable version of Mario
Wait a few months, if someone is willing to use their 4090 to train the model, the technology is already here. If you could play a level of Doom than Mario should be even easier.
It's probably not by just extending the context window or making the model larger, though that will of course help, because fundamentally external state and memory/simulation are two different things (right?).
Either way it seems natural that these models will soon be used for goal-oriented imagination of a task – e.g. imagine a computer agent that needs to find a particular image on a computer, it would continuously imagine the path between what it currently sees and its desired state, and unlike this model which takes user input, it would imagine that too. In some ways, to the best of my understanding, this already happens with some robot control networks, except without pixels.