Super cool, and really nice to see the continuous rapid progress of these models...

aithrowawaycomm · 2024-11-01T13:04:58 1730466298

There's not even the slightest hint of state in this demo: if you hold "turn left" for a full rotation you don't end up where you started. After a few rotations the details disappear and you're left in the middle of a blank ocean. There's no way this tech will ever make a playable version of Mario, let alone Minecraft.

blixt · 2024-11-01T13:17:12 1730467032

There's plenty of evidence of state, just a very short-term memory. Examples:

- The inventory bar is mostly consistent throughout the play

- State transitions in response to key presses

- Block breakage over time is mostly consistent

- Toggling doors / hatches works as expected

- Jumping progresses with correct physics

Turning around and seeing approximately the same thing you saw a minute ago is probably just a matter of extending a context window, but it will inherently have limits when you get to the scale of an entire world even if we somehow can make context windows have excellent compression of redundant data (which would greatly help LLM transformers too). And I guess what I'm mostly wondering about is how would you synchronize this state with a ground truth so that it can be shared between different instances of the agent, or other non-agent entities.

And again, I think it's important to remember games are just great for training this type of technology, but it's probably more useful in non-game fields such as computer automation, robot control, etc.

naed90 · 2024-11-01T18:21:26 1730485286

Hey, developer of Oasis here! You are very correct. Here are a few points: 1. We trained the model on a context window of even 30 sec. What's the problem? It barely pays any attention to frames beyond the past few ones. This certainly makes sense as it's a question of the loss function of the model during training. We are running now many different training runs to experiment with a better loss func (and datamix) to solve this issue. You'll see newer versions soon! 2. In the long term, we believe the "ultimate" solution is 2 models: 1 model that maintains game state + 1 model that turns that state into pixel. Think of it as having the first model be something resembling more of an LLM that gets the current state + user action and produces the new state, and then the second model being a diffusion model that takes from this state and maps to pixels. This would win the best of both worlds.

throwaway314155 · 2024-11-01T18:53:46 1730487226

This stuff is all fascinating to me from a computer vision perspective. I'm curious - if you have a second model tasked with learning just the game state - does that mean you would be using info from the game itself (say, via a mod or with the developer console) as training data? Or is the idea that the model somehow learns the state (and only the state) on its own as it does here?

naed90 · 2024-11-01T21:11:42 1730495502

That's a great question -- lots of experiments will be going into the future versions o Oasis. There are quite a few different possibilities here and we'll have to experiment with them a lot.

The nice thing is that we can run tons of experiments at once. For Oasis v1, we ran over 1000 experiments (end-to-end training a 500M model) on the model arch, datamix, etc., before we created the final checkpoint that's deployed on the site. At Decart (we just came out of stealth yesterday: https://www.theinformation.com/articles/why-sequoias-shaun-m...) we have 2 teams: Decart Infrastructure and Decart Experiences. The first team provides insanely fast infra for training/inferencing (writes from scratch everything from CUDA to redoing the python garbage collector) -- we are able to get a 500M model to converge during training in ~20h instead of 1-2 weeks. Then, Decart Experiences uses this infra to create these new types of end-to-end "Generated Experiences"

bongodongobob · 2024-11-01T13:58:01 1730469481

Nah, it doesn't even track which direction you're looking. Looking straight ahead, walk into some sugar cane so your whole screen is green. Now look up. It thinks you were looking down.

blixt · 2024-11-01T14:02:57 1730469777

I guess it comes down to your definition of state. I'm not saying there's enough state for this to be playable, but there is clearly state and I think it's important to point out how impressive the amount of temporal consistence and coherence this model is capable of, considering not long ago the state of the art here rapidly decohered into completely noisy pixels.

FeepingCreature · 2024-11-01T17:25:06 1730481906

In other words: there's enough state now that the lack of state stands out. It works well enough for its failures to be notable.

bongodongobob · 2024-11-01T18:15:33 1730484933

I guess if you consider knowing what color the pixels were in the last frame "state". That's not a definition anyone would use though. Spinning around and have the world continuously regenerate or looking at the sky and back down regenerating randomly is the opposite of state. It's complete incoherence.

bubblyworld · 2024-11-03T06:19:38 1730614778

Just a thought - complete incoherence is a noise function, no? Successive frames here are far more correlated than that, which is pretty remarkable.

My definition of state is something like reified bits of information, for which previous frames and such certainly count (knowing the current frame tells you a lot of information about the next frame vs not knowing the current frame).

golol · 2024-11-01T14:44:18 1730472258

Between the first half and the last sentence of your post is a giant leap of conclusion.

blixt · 2024-11-01T15:07:15 1730473635

Yeah probably, it remains to be seen if these models can actually help guide a live session towards the goal. At least it's been shown that these types of world models can help a model become better at achieving a goal, in lieu of a hard coded simulation environment or the real world, for when those options are not tractable.

My favorite example is: https://worldmodels.github.io/ (Not least of all because they actually simulate these simplified world models in your browser!)

GaggiX · 2024-11-01T15:58:22 1730476702

>There's no way this tech will ever make a playable version of Mario

Wait a few months, if someone is willing to use their 4090 to train the model, the technology is already here. If you could play a level of Doom than Mario should be even easier.