Diffusion models are real-time game engines

vessenes · 2024-08-28T03:29:24 1724815764

So, this is surprising. Apparently there’s more cause, effect, and sequencing in diffusion models than what I expected, which would be roughly ‘none’. Google here uses SD 1.4, as the core of the diffusion model, which is a nice reminder that open models are useful to even giant cloud monopolies.

The two main things of note I took away from the summary were: 1) they got infinite training data using agents playing doom (makes sense), and 2) they added Gaussian noise to source frames and rewarded the agent for ‘correcting’ sequential frames back, and said this was critical to get long range stable ‘rendering’ out of the model.

That last is intriguing — they explain the intuition as teaching the model to do error correction / guide it to be stable.

Finally, I wonder if this model would be easy to fine tune for ‘photo realistic’ / ray traced restyling — I’d be super curious to see how hard it would be to get a ‘nicer’ rendering out of this model, treating it as a doom foundation model of sorts.

Anyway, a fun idea that worked! Love those.

wavemode · 2024-08-28T11:59:33 1724846373

> Apparently there’s more cause, effect, and sequencing in diffusion models than what I expected

To temper this a bit, you may want to pay close attention to the demo videos. The player rarely backtracks, and for good reason - the few times the character does turn around and look back at something a second time, it has changed significantly (the most noticeable I think is the room with the grey wall and triangle sign).

This falls in line with how we'd expect a diffusion model to behave - it's trained on many billions of frames of gameplay, so it's very good at generating a plausible -next- frame of gameplay based on some previous frames. But it doesn't deeply understand logical gameplay constraints, like remembering level geometry.

dewarrn1 · 2024-08-28T12:54:59 1724849699

Great observation. And not entirely unlike normal human visual perception which is notoriously vulnerable to missing highly salient information; I'm reminded of the "gorillas in our midst" work by Dan Simons and Christopher Chabris [0].

[0]: https://en.wikipedia.org/wiki/Inattentional_blindness#Invisi...

lawlessone · 2024-08-28T14:11:21 1724854281

I reminds me of dreaming. When you do something and turn back to check it has turned into something completely different.

edit: someone should train it on MyHouse.wad

robotresearcher · 2024-08-28T14:28:29 1724855309

Not noticing to a gorilla that ‘shouldn’t’ be there is not the same thing as object permanence. Even quite young babies are surprised by objects that go missing.

dewarrn1 · 2024-08-28T14:47:31 1724856451

That's absolutely true. It's also well-established by Simons et al. and others that healthy normal adults maintain only a very sparse visual representation of their surroundings, anchored but not perfectly predicted by attention, and this drives the unattended gorilla phenomenon (along with many others). I don't work in this domain, but I would suggest that object permanence probably starts with attending and perceiving an object, whereas the inattentional or change blindness phenomena mostly (but not exclusively) occur when an object is not attended (or only briefly attended) or attention is divided by some competing task.

bamboozled · 2024-08-28T14:10:25 1724854225

Are you saying if I turn around, I’ll be surprised at what I find ? I don’t feel like this is accurate at all.

dewarrn1 · 2024-08-28T14:52:26 1724856746

Not exactly, but our representation of what's behind us is a lot more sparse than we would assume. That is, I might not be surprised by what I see when I turn around, but it could have changed pretty radically since I last looked, and I might not notice. In fact, an observer might be quite surprised that I missed the change.

Objectively, Simons and Chabris (and many others) have a lot of data to support these ideas. Subjectively, I can say that these types of tasks (inattentional blindness, change blindness, etc.) are humbling.

jerf · 2024-08-28T15:55:38 1724860538

Well, it's a bit of a spoiler to encounter this video in this context, but this is a very good video: https://www.youtube.com/watch?v=LRFMuGBP15U

Even having a clue why I'm linking this, I virtually guarantee you won't catch everything.

And even if you do catch everything... the real thing to notice is that you had to look. Your brain does not flag these things naturally. Dreams are notorious for this sort of thing, but even in the waking world your model of the world is much less rich than you think. Magic tricks like to hide in this space, for instance.

dewarrn1 · 2024-08-28T18:29:51 1724869791

Yup, great example! Simons's lab has done some things along exactly these lines [0], too.

[0]: https://www.youtube.com/watch?v=wBoMjORwA-4

ajuc · 2024-08-28T15:10:39 1724857839

The opposite - if you turn around and there's something that wasn't there the last time - you'll likely not notice if it's not out of place. You'll just assume it was there and you weren't paying attention.

We don't memorize things that the environment remembers for us if they aren't relevant for other reasons.

bamboozled · 2024-08-29T02:14:01 1724897641

We also don't just work on images. We work on a lot of sensory data. So i think images of the environment are just one part of it.

matheusd · 2024-08-28T14:32:09 1724855529

If a generic human glances at an unfamiliar screen/wall/room, can they accurately, pixel-perfectly reconstruct every single element of it? Can they do it for every single screen they have seen in their entire lives?

bamboozled · 2024-08-28T14:44:43 1724856283

I never said pixel perfect, but I would be surprised if whole objects , like flaming lanterns suddenly appeared.

What this demo demonstrates to me is how incredible willing we are to accept what seems familiar to us as accurate.

I bet if you look closely and objectively you will see even more anomalies. But at first watch, I didn’t see most errors because I think accepting something is more efficient for the brain.

ben_w · 2024-08-28T16:25:40 1724862340

You'd likely be surprised by a flaming lantern unless you were in Flaming Lanterns 'R Us, but if you were watching a video of a card trick and the two participants changed clothes while the camera wasn't focused on them, you may well miss that and the other five changes that came with that.

throwway_278314 · 2024-08-28T16:46:52 1724863612

Work which exaggerates the blindness.

The people were told to focus very deeply on a certain aspect of the scene. Maintaining that focus means explicitly blocking things not related to that focus. Also, there is social pressure at the end to have peformed well at the task; evaluating them on a task which is intentionally completely different than the one explicitly given is going to bias people away from reporting gorillas.

And also, "notice anything unusual" is a pretty vague prompt. No-one in the video thought the gorillas were unusual, so if the PEOPLE IN THE SCENE thought gorillas were normal, why would I think they were strange? Look at any TV show, they are all full of things which are pretty crazy unusual in normal life, yet not unusual in terms of the plot.

Why would you think the gorillas were unusual?

dewarrn1 · 2024-08-28T17:01:37 1724864497

I understand what you mean. I believe that the authors would contend that what you're describing is a typical attentional state for an awake/aware human: focused mostly on one thing, and with surprisingly little awareness of most other things (until/unless they are in turn attended).

Furthermore, even what we attend to isn't always represented with all that much detail. Simons has a whole series of cool demonstration experiments where they show that they can swap out someone you're speaking with (an unfamiliar conversational partner like a store clerk or someone asking for directions), and you may not even notice [0]. It's rather eerie.

[0]: https://www.youtube.com/watch?v=FWSxSQsspiQ&t=5s

InDubioProRubio · 2024-08-29T08:02:03 1724918523

Does that work on autistic people? Having no filters or fewer filters, should allow them to be more efficient "on guard duty" looking for unexpected things.

nmstoker · 2024-08-28T13:01:12 1724850072

I saw a longer video of this that Ethan Mollick posted and in that one, the sequences are longer and they do appear to demonstrate a fair amount of consistency. The clips don't backtrack in the summary video on the paper's home page because they're showing a number of district environments but you only get a few seconds of each.

If I studied the longer one more closely, I'm sure inconsistencies would be seen but it seemed able to recall presence/absence of destroyed items, dead monsters etc on subsequent loops around a central obstruction that completely obscured them for quite a while. This did seem pretty odd to me, as I expected it to match how you'd described it.

wavemode · 2024-08-28T13:17:52 1724851072

Yes it definitely is very good for simulating gameplay footage, don't get me wrong. Its input for predicting the next frame is not just the previous frame, it has access to a whole sequence of prior frames.

But to say the model is simulating actual gameplay (i.e. that a person could actually play Doom in this) is far fetched. It's definitely great that the model was able to remember that the gray wall was still there after we turned around, but it's untenable for actual gameplay that the wall completely changed location and orientation.

TeMPOraL · 2024-08-28T14:02:05 1724853725

> it's untenable for actual gameplay that the wall completely changed location and orientation.

It would in an SCP-themed game. Or dreamscape/Inception themed one.

Hell, "you're trapped in Doom-like dreamscape, escape before you lose your mind" is a very interesting pitch for a game. Basically take this Doom thing and make walking though a specific, unique-looking doorway from the original game to be the victory condition - the player's job would be to coerce the model to generate it, while also not dying in the Doom fever dream game itself. I'd play the hell out of this.

(Implementation-wise, just loop in a simple recognition model to continously evaluate victory condiiton from last few frames, and some OCR to detect when player's hit points indicator on the HUD drops to zero.)

(I'll happily pay $100 this year to the first project that gets this to work. I bet I'm not the only one. Doesn't have to be Doom specifically, just has to be interesting.)

kridsdale1 · 2024-08-28T15:34:46 1724859286

Check out the actual modern DOOM WAD MyHouse which implements these ideas. It totally breaks our preconceptions of what the DOOM engine is capable of.

https://en.wikipedia.org/wiki/MyHouse.wad

jsheard · 2024-08-28T17:09:02 1724864942

MyHouse is excellent, but it mostly breaks our perception of what the Doom engine is capable of by not really using the Doom engine. It leans heavily on engine features which were embellishments by the GZDoom project, and never existed in the original Doom codebase.

wavemode · 2024-08-28T14:25:07 1724855107

To be honest, I agree! That would be an interesting gameplay concept for sure.

Mainly just wanted to temper expectations I'm seeing throughout this thread that the model is actually simulating Doom. I don't know what will be required to get from here to there, but we're definitely not there yet.

KajMagnus · 2024-08-28T14:46:57 1724856417

Or if training the model on many FPS games? Surviving in one nightmare that morphs into another, into another, into another ...

ValentinA23 · 2024-08-28T14:37:41 1724855861

What you're pointing at mirrors the same kind of limitation in using LLMs for role-play/interactive fictions.

lawlessone · 2024-08-28T16:29:24 1724862564

Maybe a hybrid approach would work. Certain things like inventory being stored as variables, lists etc.

Wouldn't be as pure though.

crooked-v · 2024-08-28T22:16:47 1724883407

Give it state by having a rendered-but-offscreen pixel area that's fed back in as byte data for the next frame.

TeMPOraL · 2024-08-29T09:17:04 1724923024

Huh.

Fun variant: give it hidden state by doing the offscreen scratch pixel buffer thing, but not grading its content in training. Train the model as before, grading on the "onscreen" output, and let it keep the side channel to do what it wants with. It'd be interesting to see what way it would use it, what data it would store, and how it would be encoded.

dr_dshiv · 2024-08-28T13:37:17 1724852237

It's an empirical question, right? But they didn't do it...

whiteboardr · 2024-08-28T12:44:39 1724849079

But does it need to be frame-based?

What if you combine this with an engine in parallel that provides all geometry including characters and objects with their respective behavior, recording changes made through interactions the other model generates, talking back to it?

A dialogue between two parties with different functionality so to speak.

(Non technical person here - just fantasizing)

robotresearcher · 2024-08-28T14:33:46 1724855626

In that scheme what is the NN providing that a classical renderer would not? DOOM ran great on an Intel 486, which is not a lot of computer.

Sohcahtoa82 · 2024-08-28T16:18:40 1724861920

> DOOM ran great on an Intel 486

It always blew my mind how well it worked on a 33 Mhz 486. I'm fairly sure it ran at 30 fps in 320x200. That gives it just over 17 clock cycles per pixel, and that doesn't even include time for game logic.

My memory could be wrong, though, but even if it required a 66 Mhz to reach 30 fps, that's still only 34 clocks per pixel on an architecture that required multiple clocks for a simple integer add instruction.

whiteboardr · 2024-08-28T14:38:11 1724855891

An experience that isn’t asset- but rule-based.

bee_rider · 2024-08-28T13:12:15 1724850735

In that case, the title of the article wouldn’t be true anymore. It seems like a better plan, though.

beepbooptheory · 2024-08-28T13:27:24 1724851644

What would the model provide if not what we see on the screen?

whiteboardr · 2024-08-28T14:07:07 1724854027

The environment and everything in it.

“Everything” would mean all objects and the elements they’re made of, their rules on how they interact and decay.

A modularized ecosystem i guess, comprised of “sub-systems” of sorts.

The other model, that provides all interaction (cause for effect) could either be run artificially or be used interactively by a human - opening up the possibility for being a tree : )

This all would need an interfacing agent that in principle would be an engine simulating the second law of thermodynamics and at the same time recording every state that has changed and diverged off the driving actor’s vector in time.

Basically the “effects” model keeping track of everyones history.

In the end a system with an “everything” model (that can grow overtime), a “cause” model messing with it, brought together and documented by the “effect” model.

(Again … non technical person, just fantasizing) : )

BizarroLand · 2024-08-30T16:00:55 1725033655

For instance, for a generated real world RPG, one process could create the planet, one could create the city where the player starts, one could create the NPCs, one could then model the relationships of the npcs with each other. Each one building off of the other so that the whole thing feels nuanced and more real.

Repeat for quest lines, new cities, etc, with the npcs having real time dialogue and interactions that happen entirely off screen, no guarantee of there being a massive quest objective, and some sort of recorder of events that keeps a running tally of everything that goes on so that as the PCs interact with it they are never repeating the same dreary thing.

If this were a MMORPG it would require so much processing and architecting, but it would have the potential to be the greatest game in human history.

mplewis · 2024-08-28T20:15:48 1724876148

What you’re asking for doesn’t make sense.

HappMacDonald · 2024-08-28T22:32:19 1724884339

So you're basically just talking about upgrading "enemy AI" to a more complex form of AI :)

mensetmanusman · 2024-08-28T12:14:49 1724847289

That is kind of cool though, I would play like being lost in a dream.

If on the backend you could record the level layouts in memory you could have exploration teams that try to find new areas to explore.

debo_ · 2024-08-28T12:35:19 1724848519

It would be cool for dream sequences in games to feel more like dreams. This is probably an expensive way to do it, but it would be neat!

codeflo · 2024-08-28T12:32:03 1724848323

Even purely going forward, specks on wall textures morph into opponents and so on. All the diffusion-generated videos I’ve seen so far have this kind of unsettling feature.

bee_rider · 2024-08-28T13:14:55 1724850895

It it like some kind of weird dream doom.

hoosieree · 2024-08-28T13:11:20 1724850680

Small objects like powerups appear and disappear as the player moves (even without backtracking), the ammo count is constantly varying, getting shot doesn't deplete health or armor, etc.

TeMPOraL · 2024-08-28T13:26:06 1724851566

So for the next iteration, they should add a minimap overlay (perhaps on a side channel) - it should help the model give more consistent output in any given location. Right now, the game is very much like a lucid dream - the universe makes sense from moment to moment, but without outside reference, everything that falls out of short-term memory (few frames here) gets reimagined.

Groxx · 2024-08-28T12:25:38 1724847938

There's an example right at the beginning too - the ammo drop on the right changes to something green (I think that's a body?)

Workaccount2 · 2024-08-28T13:55:32 1724853332

I don't see this as something that would be hard to overcome. Sora for instance has already shown the ability for a diffusion model to maintain object permanence. Flux recently too has shown the ability to render the same person in many different poses or images.

idunnoman1222 · 2024-08-28T14:04:36 1724853876

Where does a sora video turn around backwards? I can’t maintain such consistency in my own dreams.

Workaccount2 · 2024-08-28T14:57:40 1724857060

I don't know of an example (not to say it doesn't exist) but the problem is fundamentally the same as things moving out of sight/out of frame and coming back again.

Jensson · 2024-08-28T16:36:16 1724862976

> the problem is fundamentally the same as things moving out of sight/out of frame and coming back again

Maybe it is, but doing that with the entire scene instead of just a small part of it makes the problem massively harder, as the model needs to grow exponentially to remember more things. It isn't something that we will manage anytime soon, maybe 10-20 years with current architecture and same compute progress.

Then you make that even harder by remembering a whole game level? No, ain't gonna happen in our lifetimes without massive changes to the architecture. They would need to make a different model keep track of level state etc, not just an image to image model.

Workaccount2 · 2024-08-28T21:30:13 1724880613

10 to 20 years sounds wildly pessimistic

In this sora video the dragon covers half the scene, and its basically identical when it is revealed again ~5 seconds later, or about 150 frames later. The is lots of evidence (and some studies) that these models are in fact building internal world models.

https://www.youtube.com/watch?v=LXJ-yLiktDU

Buckle in, the train is moving way faster. I don't think there would be much surprise if this is solved in the next few generations of video generators. The first generation is already doing very well.

Jensson · 2024-08-28T22:12:58 1724883178

Did you watch the video, it is completely different after the dragon goes past? Its still a flag there, but everything else changed. Even the stores in the background changed, the mass of people is completely different with no hint of anyone moving there etc.

You always get this from AI enthusiast, they come and post "proof" that disproves their own point.

HappMacDonald · 2024-08-28T22:48:54 1724885334

I'm not GP, but running over that video I'm actually having a hard time finding any detail present before the dragon obscures them not either exit frame right when the camera pans left slightly near the end or not re-appear with reasonably crisp detail after the dragon gets out of the way.

Most of the mob of people are indistinct, but there is a woman in a lime green coat who is visible, and then obstructed by the dragon twice (beard and ribbon) and reappears fine. Unfortunately when dragon fully moves past she has been lost to frame right.

There is another person in black holding a red satchel which is visible both before and after the dragon has passed.

Nothing about the storefronts appear to change. The complex sign full of Chinese text (which might be gibberish text: it's highly stylized and I don't know Chinese) appears to survive the dragon passing without even any changes to the individual ideograms.

There is also a red box shaped like a Chinese paper lantern with a single gold ideogram on it at the store entrance which spends most of the video obscured by the dragon and is still in the same location after it passes (though video artifacting makes it more challenging to verify that that ideogram is unchanged it certainly does not appear substantially different)

What detail are you seeing that is different before and after the obstruction?

Jensson · 2024-08-29T01:49:44 1724896184

> What detail are you seeing that is different before and after the obstruction?

First frame, guy in blue hat next to a flag. That flag and the guy is then gone afterwards.

The two flags near the wall are gone, there is something triangular there but there was two flags before the dragon went past.

Then not to mention that the crowd is 6 people deep after the dragon went past, while just 4 people deep before, it is way more crowded.

Instead of the flag that was there before the dragon, it put in 2 more flags afterwards far more to the left.

Third second a guy was out of frame for a few frames, and suddenly gained a blue scarf. AFter dragon went by he turned into a woman. Next to that person was a guy with a blue cap, he completely disappears.

> Most of the mob of people are indistinct

No they aren't, they are mostly distinct and basically all of them changes. If you ignore that the entire mob totally changes both in number and appearance and where it is, sure it is pretty good, except it forgot the flags, but how can you ignore the mob when we talk about the model remembering details? The wall is much less information dense than the mob, so that is much easier to remember for the model, the difficulty is in the mob.

> but there is a woman in a lime green coat who is visible,

She was just out of frame for a fraction of a second, not the big bit where the dragon moves past. The guy in blue jacket and blue cap behind her disappears though, or merges with another person and becomes a woman with a muffler after the dragon moved past.

So, in the end some big strokes were kept, and that was a very tiny part of the image that was both there before and after the dragon moved past so it was far from a whole image with full details. Almost all details are wrong.

Maybe he meant that the house looked mostly the same, I agree the upper parts does, but I looked at the windows and they were completely different, it is full of people heads after the dragon moved past while before it was just clean walls.

Workaccount2 · 2024-08-29T13:48:43 1724939323

We are looking at first generation tech and pretty much every human would recognize the "before dragon" scene as being the same place as the "after dragon" scene. The prominent features are present. The model clearly shows the ability to go beyond "image-to-image" rendering.

If you want to be right because you can find any difference. Sure. You win. But also completely missed the point.

Jensson · 2024-08-29T14:46:23 1724942783

> pretty much every human would recognize the "before dragon" scene as being the same place as the "after dragon" scene

Not in a game and those were enemies, it completely changed what and how many they are, people would notice such a massive change instantly if they looked away and suddenly there were 50% more enemies.

> The model clearly shows the ability to go beyond "image-to-image" rendering.

I never argued against that. Adding a third dimension (time) makes generating a video the same kind of problem as generating an image, it is not harder to draw a straight pencil with something covering it than to draw the scene with something covering it for a while.

But still, even though it is that simple, these models are really bad at it, because it requires very large models and much compute. So I just extrapolated based on their current abilities that we know, as you demonstrated there, to say roughly how long until we can even have consistent short videos.

Note that videos wont have the same progression as images, as the early image models were very small and we quickly scaled up there, while now for video we start at really scaled up models and we have to wait until compute gets cheaper/faster the slow way.

> But also completely missed the point.

You completely missed my point or you changed your point afterwards. My point was that current models can only remember little bits under such circumstances, and to remember a whole scene they need to be massively larger. Almost all details in the scene you showed were missed, the large strokes are there but to keep the details around you need an exponentially larger model.

idunnoman1222 · 2024-08-28T14:01:45 1724853705

Where does a sora video turn around backwards? I don’t even maintain such consistency in my dreams.

nielsbot · 2024-08-28T17:34:10 1724866450

You can also notice in the first part of the video the ammo numbers fluctuate a bit randomly.

alickz · 2024-08-28T12:58:10 1724849890

is that something that can be solved with more memory/attention/context?

or do we believe it's an inherent limitation in the approach?

noiv · 2024-08-28T13:07:01 1724850421

I think the real question is does the player get shot from behind?

alickz · 2024-08-28T13:41:18 1724852478

great question

tangentially related but Grand Theft Auto speedrunners often point the camera behind them while driving so cars don't spawn "behind" them (aka in front of the car)

refibrillator · 2024-08-28T05:26:50 1724822810

Just want to clarify a couple possible misconceptions:

The diffusion model doesn’t maintain any state itself, though its weights may encode some notion of cause/effect. It just renders one frame at a time (after all it’s a text to image model, not text to video). Instead of text, the previous states and frames are provided as inputs to the model to predict the next frame.

Noise is added to the previous frames before being passed into the SD model, so the RL agents were not involved with “correcting” it.

De-noising objectives are widespread in ML, intuitively it forces a predictive model to leverage context, ie surrounding frames/words/etc.

In this case it helps prevent auto-regressive drift due to the accumulation of small errors from the randomness inherent in generative diffusion models. Figure 4 shows such drift happening when a player is standing still.

rvnx · 2024-08-28T12:00:06 1724846406

The concept is that if you train a Diffusion model by feeding all the possible frames seen in the game.

The training was over almost 1 billion frames, 20 days of full-time play-time, taking a screenshot of every single inch of the map.

Now you show him N frames as input, and ask it "give me frame N+1", then it gives you the frame n. N+1 back based on how it was originally seen during training.

But it is not frame N+1 from a mysterious intelligence, it's simply frame N+1 given back from past database.

The drift you mentioned is actually a clear (but sad) proof that the model does not work at inventing new frames, and can only spit out an answer from the past dataset.

It's a bit like if you train stable diffusion on Simpsons episodes, and that it outputs the next frame of an existing episode that was in the training set, but few frames later goes wild and buggy.

mensetmanusman · 2024-08-28T12:16:16 1724847376

Research is the acquisition of knowledge that may or may not have practical applications.

They succeeded in the research, gained knowledge, and might be able to do something awesome with it.

It’s a success even if they don’t sell anything.

jetrink · 2024-08-28T12:12:23 1724847143

I don't think you've understood the project completely. The model accepts player input, so frame 601 could be quite different if the player decided to turn left rather than right, or chose that moment to fire at an exploding barrel.

rvnx · 2024-08-28T12:20:48 1724847648

1 billion frames in memory... With such dataset, you have seen practically all realistic possibilities in the short-term.

If it would be able to invent action and maps and let the user play "infinite doom", then it would be very different (and impressive!).

TeMPOraL · 2024-08-28T13:50:47 1724853047

Like many people in case of LLMs, you're just demonstrating unawareness of - or disbelief in - the fact that the model doesn't record training data vetbatim, but smears it out in high-dimensional space, from which it then samples. The model then doesn't recall past inputs (which are effectively under extreme lossy compression), but samples from that high-dimensional space to produce output. The high-dimensional representation by necessity captures semantic understanding of the training data.

Generating "infinite Doom" is exactly what this model is doing, as it does not capture the larger map layout well enough to stay consistent with it.

Workaccount2 · 2024-08-28T14:06:44 1724854004

Whether or not a judge understands this will probably form the basis of any precedent set about the legality of image models and copyright.

znx_0 · 2024-08-28T14:34:43 1724855683

I like "conditioned brute force" better term.

OskarS · 2024-08-28T12:46:01 1724849161

> 1 billion frames in memory... With such dataset, you have seen practically all realistic possibilities in the short-term.

I mean... no? Not even close? Multiply the number of game states with the number of inputs at any given frame gives you a number vastly bigger than 1 billion, not even comparable. Even with 20 days of play time to train no, it's entirely likely that at no point did someone stop at a certain location and look to the left from that angle. They might have done from similar angles, but the model then has to reconstruct some sense of the geometry of the level to synthesize the frame. They might also not have arrived there from the same direction, which again the model needs some smarts to understand.

I get your point, it's very overtrained on these particular levels of Doom, which means you might as well just play Doom. But this is not a hash table lookup we're talking about, it's pretty impressive work.

rvnx · 2024-08-28T13:05:43 1724850343

This was the basis for the reasoning:

The map 1 has 2'518 walkable map units. There are 65536 angles.

2'518*65'536=165'019'648

If you capture 165M frames, you already cover all the possibilities in terms of camera / player view, but probably the diffusion models don't even need to have all the frames (the same way that LLMs don't).

commodoreboxer · 2024-08-28T13:37:21 1724852241

There's also enemy motion, enemy attacks, shooting, and UI considerations, which make the combinatorials explode.

And Doom movement isn't tile based. The map may be, but you can be in many many places on a tile.

bee_rider · 2024-08-28T13:18:37 1724851117

Do you have to be exactly on a tile in Doom? I thought the guy walked smoothly around the map.

cthalupa · 2024-08-30T03:28:02 1724988482

> I thought the guy walked smoothly around the map.

Correct. You are certainly not moving between the tiles as discrete units in doom.

znx_0 · 2024-08-28T13:12:34 1724850754

I think enemy and effects are probably in there

nine_k · 2024-08-28T09:07:11 1724836031

But it's not a game. It's a memory of a game video, predicting the next frame based on the few previous frames, like "I can imagine what happened next".

I would call it the world's least efficient video compression.

What I would like to see is the actual predictive strength, aka imagination, which I did not notice mentioned in the abstract. The model is trained on a set of classic maps. What would it do, given a few frames of gameplay on an unfamiliar map as input? How well could it imagine what happens next?

PoignardAzur · 2024-08-28T10:15:11 1724840111

> But it's not a game. It's a memory of a game video, predicting the next frame based on the few previous frames, like "I can imagine what happened next".

It's not super clear from the landing page, but I think it's an engine? Like, its input is both previous images and input for the next frame.

So as a player, if you press "shoot", the diffusion engine need to output an image where the monster in front of you takes damage/dies.

bergen · 2024-08-28T12:07:11 1724846831

How is what you think they say not clear?

We present GameNGen, the first game engine powered entirely by a neural model that enables real-time interaction with a complex environment over long trajectories at high quality.

Sharlin · 2024-08-28T12:44:23 1724849063

No, it’s predicting the next frame conditioned on past frames AND player actions! This is clear from the article. Mere video generation would be nothing new.

taneq · 2024-08-28T11:32:22 1724844742

It's more like the Tetris Effect, where the model has seen so much Doom that it confabulates gameplay.

TeMPOraL · 2024-08-28T14:11:24 1724854284

It's a memory of a video looped to controls, so frame 1 is "I wonder how would it look if the player pressed D instead of W", then the frame 2 is based on frame 1, etc. and couple frames in, it's already not remembering, but imagining the gameplay on the fly. It's not prerecorded, it responds to inputs during generation. That's what makes it a game engine.

mensetmanusman · 2024-08-28T12:19:06 1724847546

They could down convert the entire model to only utilize the subset of matrix components from stable diffusion. This approach may be able to improve internet bandwidth efficiency assuming consumers in the future have powerful enough computers.

WithinReason · 2024-08-28T09:15:00 1724836500

If it's trained on absolute player coordinates then it would likely just morph into the known map at those coordinates.

nine_k · 2024-08-28T09:21:17 1724836877

But it's trained on the actual screen pixel data, AFAICT. It's literally a visual imagination model, not gameplay / geometry imagination model. They had to make special provisions to the pixel data on the HUD which by its nature different than the pictures of a 3D world.

pradn · 2024-08-28T15:23:32 1724858612

> Google here uses SD 1.4, as the core of the diffusion model, which is a nice reminder that open models are useful to even giant cloud monopolies.

A mistake people make all the time is that massive companies will put all their resources toward every project. This paper was written by four co-authors. They probably got a good amount of resources, but they still had to share in the pool allocated to their research department.

Even Google only has one Gemini (in a few versions).

fennecfoxy · 2024-08-29T09:40:04 1724924404

If anybody Google would know most about that after their LLM memo all that time ago (basically "we're losing because we're trying to fight/compete with OS models"): https://www.semianalysis.com/p/google-we-have-no-moat-and-ne...

raghavbali · 2024-08-28T14:03:42 1724853822

Nicely summarised. Another important thing that clearly standsout (not to undermine the efforts and work gone into this) is the fact that more and more we are now seeing larger and more complex building blocks emerging (first it was embedding models then encoder decoder layers and now whole models are being duck-taped for even powerful pipelines). AI/DL ecosystem is growing on a nice trajectory.

Though I wonder if 10 years down the line folks wouldn't even care about underlying model details (no more than a current day web-developer needs to know about network packets).

PS: Not great examples, but I hope you get the idea ;)

bubaumba · 2024-08-29T01:10:14 1724893814

> nice reminder that open models are useful to

You didn't say open _what_ models. Was that intentional?

Philpax · 2024-08-29T02:28:27 1724898507

They did, SD 1.4

wkcheng · 2024-08-28T03:42:40 1724816560

It's insane that that this works, and that it works fast enough to render at 20 fps. It seems like they almost made a cross between a diffusion model and an RNN, since they had to encode the previous frames and actions and feed it into the model at each step.

Abstractly, it's like the model is dreaming of a game that it played a lot of, and real time inputs just change the state of the dream. It makes me wonder if humans are just next moment prediction machines, with just a little bit more memory built in.

lokimedes · 2024-08-28T04:54:03 1724820843

It makes good sense for humans to have this ability. If we flip the argument, and see the next frame as a hypothesis for what is expected as the outcome of the current frame, then comparing this "hypothesis" with what is sensed makes it easier to process the differences, rather than the totality of the sensory input.

As Richard Dawkins recently put it in a podcast[1], our genes are great prediction machines, as their continued survival rests on it. Being able to generate a visual prediction fits perfectly with the amount of resources we dedicate to sight.

If that is the case, what does aphantasia tell us?

[1] https://podcasts.apple.com/dk/podcast/into-the-impossible-wi...

dbspin · 2024-08-28T08:45:30 1724834730

Worth noting that aphantasia doesn't necessarily extend to dreams. Anecdotally - I have pretty severe aphantasia (I can conjure milisecond glimpses of barely tangible imagery that I can't quite perceive before it's gone - but only since learning that visualisation wasn't a linguistic metaphor). I can't really simulate object rotation. I can't really 'picture' how things will look before they're drawn / built etc. However I often have highly vivid dream imagery. I also have excellent recognition of faces and places (e.g.: can't get lost in a new city). So there clearly is a lot of preconscious visualisation and image matching going on in some aphantasia cases, even where the explicit visual screen is all but absent.

lokimedes · 2024-08-28T10:47:30 1724842050

I fabulate about this in another comment below:

> Many people with aphantasia reports being able to visualize in their dreams, meaning that they don't lack the ability to generate visuals. So it may be that the [aphantasia] brain has an affinity to rely on the abstract representation when "thinking", while dreaming still uses the "stable diffusion mode".

(I obviously don't know what I'm talking about, just a fellow aphant)

dbspin · 2024-08-28T13:21:28 1724851288

Obviously we're all introspecting here - but my guess is that there's some kind of cross talk in aphantasic brains between the conscious narrating semantic brain and the visual module. Such that default mode visualisation is impaired. It's specifically the loss of reflexive consciousness that allows visuals to emerge. Not sure if this is related, but I have pretty severe chronic insomnia, and I often wonder if this in part relates to the inability to drift off into imagery.

drowsspa · 2024-08-28T14:20:08 1724854808

Yeah. In my head it's like I'm manipulating SVG paths instead of raw pixels

zimpenfish · 2024-08-28T10:11:00 1724839860

Pretty much the same for me. My aphantasia is total (no images at all) but still ludicrously vivid dreams and not too bad at recognising people and places.

jonplackett · 2024-08-28T08:32:09 1724833929

What’s the aphantasia link? I’ve got aphantasia. I’m convinced though that the bit of my brain that should be making images is used for letting me ‘see’ how things are connected together very easily in my head. Also I still love games like Pictionary and can somehow draw things onto paper than I don’t really know what they look like in my head. It’s often a surprise when pen meets paper.

lokimedes · 2024-08-28T08:43:37 1724834617

I agree, it is my own experience as well. Craig Venter In one of his books also credit this way of representing knowledge as abstractions as his strength in inventing new concepts.

The link may be that we actually see differences between “frames”, rather than the frames directly. That in itself would imply that a from of sub-visual representation is being processed by our brain. For aphantasia, it could be that we work directly on this representation instead of recalling imagery through the visual system.

Many people with aphantasia reports being able to visualize in their dreams, meaning that they don't lack the ability to generate visuals. So it may be that the brain has an affinity to rely on the abstract representation when "thinking", while dreaming still uses the "stable diffusion mode".

I’m no where near qualified to speak of this with certainty, but it seems plausible to me.

quickestpoint · 2024-08-28T06:58:09 1724828289

As Richard Dawkins theorized, would be more accurate and less LLM like :)

nsbk · 2024-08-28T11:08:53 1724843333

We are. At least that's what Lisa Feldman Barrett [1] thinks. It is worth listening to this Lex Fridman podcast: Counterintuitive Ideas About How the Brain Works [2], where she explains among other ideas how constant prediction is the most efficient way of running a brain as opposed to reaction. I never get tired of listening to her, she's such a great science communicator.

[1] https://en.wikipedia.org/wiki/Lisa_Feldman_Barrett

[2] https://www.youtube.com/watch?v=NbdRIVCBqNI&t=1443s

PunchTornado · 2024-08-28T14:45:57 1724856357

Interesting talk about the brain, but the stuff she says about free will is not a very good argument. Basically it is sort of the argument that the ancient greeks made which brings the discussion into a point where you can take both directions.

stevenhuang · 2024-08-28T06:00:28 1724824828

> It makes me wonder if humans are just next moment prediction machines, with just a little bit more memory built in.

Yup, see https://en.wikipedia.org/wiki/Predictive_coding

quickestpoint · 2024-08-28T06:53:54 1724828034

Umm, that’s a theory.

mind-blight · 2024-08-28T12:50:38 1724849438

So are gravity and friction. I don't know how well tested or accepted it is, but being just a theory doesn't tell you much about how true it is without more info

bangaladore · 2024-08-28T21:59:55 1724882395

> It's insane that that this works, and that it works fast enough to render at 20 fps.

It is running on an entire v5 TPU (https://cloud.google.com/blog/products/ai-machine-learning/i...)

It's unclear how that compares to a high-end consumer GPU like a 3090, but they seem to have similar INT8 TFLOPS. The TPU has less memory (16 vs. 24), and I'm unsure of the other specs.

Something doesn't add up, in my opinion, though. SD usually takes (at minimum) seconds to produce a high-quality result on a 3090, so I can't comprehend how they are like 2 orders of magnitudes faster—indicating that the TPU vastly outperforms a GPU for this task. They seem to be producing low-res (320x240) images, but it still seems too fast.

Philpax · 2024-08-29T02:31:05 1724898665

There's been a lot of work in optimising inference speed of SD - SD Turbo, latent consistency models, Hyper-SD, etc. It is very possible to hit these frame rates now.

dartos · 2024-08-28T15:19:19 1724858359

> It makes me wonder if humans are just next moment prediction machines, with just a little bit more memory built in.

This, to me, seems extremely reductionist. Like you start with AI and work backwards until you frame all cognition as next something predictors.

It’s just the stochastic parrot argument again.

wrsh07 · 2024-08-28T14:22:47 1724854967

Makes me wonder when an update to the world models paper comes out where they drop in diffusion models: https://worldmodels.github.io/

Teever · 2024-08-28T04:52:59 1724820779

Also recursion and nested virtualization. We can dream about dreaming and imagine different scenarios, some completely fictional or simply possible future scenarios all while doing day to day stuff.

mensetmanusman · 2024-08-28T12:22:41 1724847761

Penrose (Nobel prize in physics) stipulates that quantum effects in the brain may allow a certain amount of time travel and back propagation to accomplish this.

wrsh07 · 2024-08-28T14:21:53 1724854913

You don't need back propagation to learn

This is an incredibly complex hypothesis that doesn't really seem justified by the evidence

richard___ · 2024-08-28T06:21:14 1724826074

Did they take in the entire history as context?

slashdave · 2024-08-28T05:11:02 1724821862

Image is 2D. Video is 3D. The mathematical extension is obvious. In this case, low resolution 2D (pixels), and the third dimension is just frame rate (discrete steps). So rather simple.

Sharlin · 2024-08-28T05:15:31 1724822131

This is not "just" video, however. It's interactive in real time. Sure, you can say that playing is simply video with some extra parameters thrown in to encode player input, but still.

slashdave · 2024-08-28T05:32:45 1724823165

It is just video. There are no external interactions.

Heck, it is far simpler than video, because the point of view and frame is fixed.

SeanAnderson · 2024-08-28T05:57:17 1724824637

I think you're mistaken. The abstract says it's interactive, "We present GameNGen, the first game engine powered entirely by a neural model that enables real-time interaction"

Further - "a diffusion model is trained to produce the next frame, conditioned on the sequence of past frames and actions." specifically "and actions"

User input is being fed into this system and subsequent frames take that into account. The user is "actually" firing a gun.

slashdave · 2024-08-28T15:43:05 1724859785

No, I am not. The interaction is part of the training, and is used during inference, but it is not including during the process of generation.

SeanAnderson · 2024-08-28T15:50:17 1724860217

Okay, I think you're right. My mistake. I read through the paper more closely and I found the abstract to be a bit misleading compared to the contents. Sorry.

slashdave · 2024-08-28T16:59:07 1724864347

Don't worry. The paper is not very well written.

psb217 · 2024-08-28T20:57:49 1724878669

Academic authors are consistently better at editing away unclear and ambiguous statements which make their work seem less impressive compared to ones which make their work seem more impressive. Maybe it's just a coincidence, lol.

smusamashah · 2024-08-28T10:29:15 1724840955

It's interactive but can it go beyond what it learned from the videos. As in, can the camera break free and roam around the map from different angles? I don't think it will be able to do that at all. There are still a few hallucinations in this rendering, it doesn't look it understands 3d.

Sharlin · 2024-08-28T13:05:58 1724850358

You might be surprised. Generating views from novel angles based on a single image is not novel, and if anything, this model has more than a single frame as input. I’d wager that it’s quite able to extrapolate DOOM-like corridors and rooms even if it hasn’t seen the exact place during training. And sure, it’s imperfect but on the other hand it works in real time on a single TPU.

hypertele-Xii · 2024-08-28T12:13:38 1724847218

Then why do monsters become blurry smudgy messes when shot? That looks like a video compression artifact of a neural network attempting to replicate low-structure image (source material contains guts exploding, very un-structured visual).

Sharlin · 2024-08-28T12:38:54 1724848734

Uh, maybe because monster death animations make up a small part of the training material (ie. gameplay) so the model has not learned to reproduce them very well?

There cannot be "video compression artifacts" because it hasn’t even seen any compressed video during training, as far as I can see.

Seriously, how is this even a discussion? The article is clear that the novel thing is that this is real-time frame generation conditioned on the previous frame(s) AND player actions. Just generating video would be nothing new.

psb217 · 2024-08-28T21:15:17 1724879717

In a sense, poorly reproducing rare content is a form of compression artifact. Ie, since this content occurs rarely in the training set, it will have less impact on the gradients and thus less impact on the final form of the model. Roughly speaking, the model is allocating fewer bits to this content, by storing less information about this content in its parameters, compared to content which it sees more often during training. I think this isn't too different from certain aspects of images, videos, music, etc., being distorted in different ways based on how a particular codec allocates its available bits.

hypertele-Xii · 2024-08-31T13:10:58 1725109858

I simply cannot take seriously anyone who exclaims that monster death animations are a minor part of Doom. It's literally a game about slaying demons. Gameplay consists almost entirely of explosions and gore, killing monsters IS THE GAME, if you can't even get that correct then what nonsense are we even looking at.

nopakos · 2024-08-28T06:50:32 1724827832

Maybe it's so advanced, it knows the players' next moves, so it is a video!

slashdave · 2024-08-28T18:59:31 1724871571

I guess you are being sarcastic, except this is precisely what it is doing. And it's not hard: player movement is low information and probably not the hardest part of the model.

raincole · 2024-08-28T05:52:45 1724824365

?

I highly suggest you to read the paper briefly before commenting on the topic. The whole point is that it's not just generating a video.

slashdave · 2024-08-28T15:44:01 1724859841

I did. It is generating a video, using latent information on player actions during the process (which it also predicts). It is not interactive.

Sharlin · 2024-08-29T09:11:42 1724922702

Uff, I guess you’re right. Mea culpa. I misread their diagram to represent inference when it was about training instead. The latter is conditioned on actions, but… how do they generate the actual output frames then? What’s the input? Is it just image-to-image based on the previous frame? The paper doesn’t seem to explain the inference part at all well :(

slashdave · 2024-08-29T18:26:22 1724955982

It should be possible to generate an initial image from Gaussian noise, including the latent information on player position

InDubioProRubio · 2024-08-28T07:42:30 1724830950

Video is also higher resolution, as the pixels flip for the high resolution world by moving through it. Swivelling your head without glasses, even the blurry world contains more information in the curve of pixelchange.

slashdave · 2024-08-28T21:26:18 1724880378

Correct, for the sprites. However, the walls in Doom are texture mapped, and so have the same issue as videos. Interesting, though, because I assume the antialiasing is something approximate, given the extreme demands on CPUs of the era.

SeanAnderson · 2024-08-28T15:55:02 1724860502

After some discussion in this thread, I found it worth pointing out that this paper is NOT describing a system which receives real-time user input and adjusts its output accordingly, but, to me, the way the abstract is worded heavily implied this was occurring.

It's trained on a large set of data in which agents played DOOM and video samples are given to users for evaluation, but users are not feeding inputs into the simulation in real-time in such a way as to be "playing DOOM" at ~20FPS.

There are some key phrases within the paper that hint at this such as "Key questions remain, such as ... how games would be effectively created in the first place, including how to best leverage human inputs" and "Our end goal is to have human players interact with our simulation.", but mostly it's just the omission of a section describing real-time user gameplay.

ollin · 2024-08-28T21:12:20 1724879540

We can't assess the quality of gameplay ourselves of course (since the model wasn't released), but one author said "It's playable, the videos on our project page are actual game play." (https://x.com/shlomifruchter/status/1828850796840268009) and the video on top of https://gamengen.github.io/ starts out with "these are real-time recordings of people playing the game". Based on those claims, it seems likely that they did get a playable system in front of humans by the end of the project (though perhaps not by the time the draft was uploaded to arXiv).

Chance-Device · 2024-08-28T16:52:49 1724863969

I also thought this, but refer back to the paper, not the abstract:

> A is the set of key presses and mouse movements…

> …to condition on actions, we simply learn an embedding A_emb for each action

So, it’s clear that in this model the diffusion process is conditioned by embedding A that is derived from user actions rather than words.

Then a noised start frame is encoded into latents and concatenated on to the noise latents as a second conditioning.

So we have a diffusion model which is trained solely on images of doom, and which is conditioned on current doom frames and user actions to produce subsequent frames.

So yes, the users are playing it.

However, it should be unsurprising that this is possible. This is effectively just a neural recording of the game. But it’s a cool tech demo.

psb217 · 2024-08-28T20:42:04 1724877724

The agent never interacts with the simulator during training or evaluation. There is no user, there is only an agent which trained to play the real game and which produced the sequences of game frames and actions that were used to train the simulator and to provide ground truth sequences of game experience for evaluation. Their evaluation metrics are all based on running short simulations in the diffusion model which are initiated with some number of conditioning frames taken from the real game engine. Statements in the paper like: "GameNGen shows that an architecture and model weights exist such that a neural model can effectively run a complex game (DOOM) interactively on existing hardware." are wildly misleading.

foota · 2024-08-28T18:26:57 1724869617

I wonder if they could somehow feed in a trained Gaussian splats model to this to get better images?

Since the splats are specifically designed for rendering it seems like it would be an efficient way for the image model to learn the geometry without having to encode it on the image model itself.

Chance-Device · 2024-08-28T20:08:30 1724875710

I’m not sure how that would help vs just training the model with the conditionings described in the paper.

I’m not very familiar with Gaussian splats models, but aren’t they just a way of constructing images using multiple superimposed parameterized Gaussian distributions, sort of like the Fourier series does with waveforms using sine and cosine waves?

I’m not seeing how that would apply here but I’d be interested in hearing how you would do it.

foota · 2024-08-28T23:55:55 1724889355

I'm not certain where it would fit in, but my thinking is this.

There's been a bunch of work on making splats efficient and good at representing geometry. Reading more, perhaps NERFs would be a better fit, since they're an actual neutral network.

My thinking is that if you trained a NERF ahead of time to represent the geometry and layout of the levels, and plug that in to the diffusion model (as a part of computing the latents, and then also on the other side so it can be used to improve the rendering) then the diffusion model could focus on learning how actions manipulate the world without having to learn the geometry representation.

Chance-Device · 2024-08-29T00:26:11 1724891171

I don’t know if that would really help, I have a hard time imagining exactly what that model would be doing in practise.

To be honest none of the stuff in the paper is very practical, you almost certainly do not want a diffusion model trying to be an entire game under any circumstances.

What you might want to do is use a diffusion model to transform a low poly, low fidelity game world into something photorealistic. So the geometry, player movement and physics etc would all make sense, and then the model paints over it something that looks like reality based on some primitive texture cues in the low fidelity render.

I’d bet money that something like that will happen and it is the future of games and video.

foota · 2024-08-29T01:04:51 1724893491

Yeah, I realize this will never be useful for much in practice (although maybe as some kind of client side prediction for cloud gaming? But likely if you could run this in real time you might as well run whatever game there is in real time as well, unless there's some kind of massive world running on the server that's too large to stream the geometry for effectively), I was mostly just trying to think of a way to avoid the issues with fake looking frames or forgetting what the level looks like when you turn around that someone mentioned.

Not exactly that, but Nvidia does something like this already, they call it DLSS. It uses previous frames and motion vector to render a next frame using machine learning.

dewarrn1 · 2024-08-28T19:21:24 1724872884

The paper should definitely be more clear on this point, but there's a sentence in section 5.2.3 that makes me think that this was playable and played: "When playing with the model manually, we observe that some areas are very easy for both, some areas are very hard for both, and in some the agent performs much better." It may be a failure of imagination, but I can't think of another reasonable way of interpreting "playing with the model manually".

7734128 · 2024-08-28T18:12:18 1724868738

What you're describing reminded me of this cool project:

https://www.youtube.com/watch?v=udPY5rQVoW0 "Playing a Neural Network's version of GTA V: GAN Theft Auto"

refibrillator · 2024-08-28T16:19:47 1724861987

You are incorrect, this is an interactive simulation that is playable by humans.

> Figure 1: a human player is playing DOOM on GameNGen at 20 FPS.

The abstract is ambiguously worded which has caused a lot of confusion here, but the paper is unmistakably clear about this point.

Kind of disappointing to see this misinformation upvoted so highly on a forum full of tech experts.

psb217 · 2024-08-28T20:46:55 1724878015

If the generative model/simulator can run at 20FPS, then obviously in principle a human could play the game in simulation at 20 FPS. However, they do no evaluation of human play in the paper. My guess is that they limited human evals to watching short clips of play in the real engine vs the simulator (which conditions on some number of initial frames from the engine when starting each clip...) since the actual "playability" is not great.

FrustratedMonky · 2024-08-28T16:34:20 1724862860

Yeah. If isn't doing this, then what could it be doing that is worth a paper? "real-time user input and adjusts its output accordingly"

rvnx · 2024-08-28T16:44:46 1724863486

There is a hint in the paper itself:

It says in a shy way that it is based on: "Ha & Schmidhuber (2018) who train a Variational Auto-Encoder (Kingma & Welling, 2014) to encode game frames into a latent vector"

So it means they most likely took https://worldmodels.github.io/ (that is actually open-source) or something similar and swapped the frame generation by Stable Diffusion that was released in 2022.

GaggiX · 2024-08-29T01:28:41 1724894921

>I found it worth pointing out that this paper is NOT describing a system which receives real-time user input and adjusts its output accordingly

Well you're wrong as specified in the first video and by the authors themselves, maybe next time check better instead of writing comments with such authoritative tone of things you don't actually know.

teamonkey · 2024-08-28T16:57:29 1724864249

I think someone is playing it, but it has a reduced set of inputs and they're playing it in a very specific way (slowly, avoiding looking back to places they've been) so as not to show off the flaws in the system.

The people surveyed in this study are not playing the game, they are watching extremely short video clips of the game being played and comparing them to equally short videos of the original Doom being played, to see if they can spot the difference.

I may be wrong with how it works, but I think this is just hallucinating in real time. It has no internal state per se, it knows what was on screen in the previous few frames and it knows what inputs the user is pressing, and so it generates the next frame. Like with video compression, it probably doesn't need to generate a full frame every time, just "differences".

As with all the previous AI game research, these are not games in any real sense. They fall apart when played beyond any meaningful length of time (seconds). Crucially, they are not playable by anyone other than the developers in very controlled settings. A defining attribute of any game is that it can be played.

lewhoo · 2024-08-28T17:16:41 1724865401

The movement of the player seems jittery a bit so I inferred something similar on that basis.

bob1029 · 2024-08-28T16:15:09 1724861709

Were the agents playing at 20 real FPS, or did this occur like a Pixar movie offline?

SeanAnderson · 2024-08-28T18:36:00 1724870160

Ehhh okay, I'm not as convinced as I was earlier. Sorry for misleading. There's been a lot of back-and-forth.

I would've really liked to see a section of the paper explicitly call out that they used humans in real time. There's a lot of sentences that led me to believe otherwise. It's clear that they used a bunch of agents to simulate gameplay where those agents submitted user inputs to affect the gameplay and they captured those inputs in their model. This made it a bit murky as to whether humans ever actually got involved.

This statement, "Our end goal is to have human players interact with our simulation. To that end, the policy π as in Section 2 is that of human gameplay. Since we cannot sample from that directly at scale, we start by approximating it via teaching an automatic agent to play"

led me to believe that while they had an ultimate goal of user input (why wouldn't they) they sufficed by approximating human input.

I was looking to refute that assumption later in the paper by hopefully reading some words on the human gameplay experience, but instead, under Results, I found:

"Human Evaluation. As another measurement of simulation quality, we provided 10 human raters with 130 random short clips (of lengths 1.6 seconds and 3.2 seconds) of our simulation side by side with the real game. The raters were tasked with recognizing the real game (see Figure 14 in Appendix A.6). The raters only choose the actual game over the simulation in 58% or 60% of the time (for the 1.6 seconds and 3.2 seconds clips, respectively)."

and it's like.. okay.. if you have a section in results on human evaluation, and your goal is to have humans play, then why are you talking just about humans reviewing video rather than giving some sort of feedback on the human gameplay experience - even if it's not especially positive?

Still, in the Discussion section, it mentions, "The second important limitation are the remaining differences between the agent’s behavior and those of human players. For example, our agent, even at the end of training, still does not explore all of the game’s locations and interactions, leading to erroneous behavior in those cases." which makes it more clear that humans gave input which went outside the bounds of the automatic agents. It doesn't seem like this would occur if it were agents simulating more input.

Ultimately, I think that the paper itself could've been more clear in this regard, but clearly the publishing website tries to be very explicit by saying upfront - "Real-time recordings of people playing the game DOOM" and it's pretty hard to argue against that.

Anyway. I repent! It was a learning experience going back and forth on my belief here. Very cool tech overall.

psb217 · 2024-08-28T20:50:42 1724878242

It's funny how academic writing works. Authors rarely produce many unclear or ambiguous statements where the most likely interpretation undersells their work...

pajeets · 2024-08-28T16:43:02 1724863382

I knew it was too good be true but seems like real time video generation can be good enough to get to a point where it feels like a truly interactive video/game

Imagine if text2game was possible. there would be some sort of network generating each frame from an image generated by text, with some underlying 3d physics simulation to keep all the multiplayer screens sync'd

this paper does not seem to be of that possibility rather some cleverly words to make you think people were playing a real time video. we can't even generate more than 5~10 second of video without it hallucinating. something this persistent would require an extreme amount of gameplay video training. it can be done but the video shown by this paper is not true to its words.

zzanz · 2024-08-28T03:32:37 1724815957

The quest to run doom on everything continues. Technically speaking, isn't this the greatest possible anti-Doom, the Doom with the highest possible hardware requirement? I just find it funny that on a linear scale of hardware specification, Doom now finds itself on both ends.

fngjdflmdflg · 2024-08-28T03:42:29 1724816549

>Technically speaking, isn't this the greatest possible anti-Doom

When I read this part I thought you were going to say because you're technically not running Doom at all. That is, instead of running Doom without Doom's original hardware/software environment (by porting it), you're running Doom without Doom itself.

ynniv · 2024-08-28T04:31:08 1724819468

It's dreaming Doom.

birracerveza · 2024-08-28T09:33:23 1724837603

We made machines dream of Doom. Insane.

daemin · 2024-08-28T10:04:22 1724839462

Time to make a sheep mod for Doom.

qingcharles · 2024-08-28T18:19:17 1724869157

Do Robots Dream of E1M1?

elwell · 2024-08-28T23:25:52 1724887552

Droom

bugglebeetle · 2024-08-28T04:15:19 1724818519

Pierre Menard, Author of Doom.

el_memorioso · 2024-08-28T05:13:55 1724822035

I applaud your erudition.

airstrike · 2024-08-28T12:18:17 1724847497

OK, this is the single most perfect comment someone could make on this thread. Diffusion me impressed.

jl6 · 2024-08-28T13:50:50 1724853050

Knee Deep in the Death of the Author.

1attice · 2024-08-28T06:25:29 1724826329

that took a moment, thank you

Terr_ · 2024-08-28T06:30:09 1724826609

> the Doom with the highest possible hardware requirement?

Isn't that possible by setting arbitrarily high goals for ray-cast rendering?

Vecr · 2024-08-28T05:00:55 1724821255

It's the No-Doom.

WithinReason · 2024-08-28T07:53:47 1724831627

Undoom?

riwsky · 2024-08-28T08:16:08 1724832968

It’s a mood.

jeffhuys · 2024-08-28T08:41:51 1724834511

Bliss

x-complexity · 2024-08-28T04:19:48 1724818788

> Technically speaking, isn't this the greatest possible anti-Doom, the Doom with the highest possible hardware requirement?

Not really? The greatest anti-Doom would be an infinite nest of these types of models predicting models predicting Doom at the very end of the chain.

The next step of anti-Doom would be a model generating the model, generating the Doom output.

nurettin · 2024-08-28T05:46:40 1724824000

Isn't this technically a model (training step) generating a model (a neural network) generating Doom output?

yuchi · 2024-08-28T05:57:56 1724824676

“…now it can implement Doom!”

rldjbpin · 2024-08-29T08:42:10 1724920930

to me the closer analogy here is the "running minecraft inside minecraft" (https://news.ycombinator.com/item?id=32901461)

godelski · 2024-08-28T08:43:23 1724834603

Doom system requirements:

  - 4 MB RAM
  - 12 MB disk space

Stable diffusion v1

  > 860M UNet and CLIP ViT-L/14 (540M)
  Checkpoint size:
    4.27 Gb 
    7.7 GB (full EMA)
  Running on a TPU-v5e
    Peak compute per chip (bf16)  197 TFLOPs
    Peak compute per chip (Int8)  393 TFLOPs
    HBM2 capacity and bandwidth  16 GB, 819 GBps
    Interchip Interconnect BW  1600 Gbps

This is quite impressive, especially considering the speed. But there's still a ton of room for improvement. It seems it didn't even memorize the game despite having the capacity to do so hundreds of times over. So we definitely have lots of room for optimization methods. Though who knows how such things would affect existing tech since the goal here is to memorize.

What's also interesting about this work is it's basically saying you can rip a game if you're willing to "play" (automate) it enough times and spend a lot more on storage and compute. I'm curious what the comparison in cost and time would be if you hired an engineer to reverse engineer Doom (how much prior knowledge do they get considering pertained models and visdoom environment. Was doom source code in T5? And which vit checkpoint was used? I can't keep track of Google vit checkpoints).

I would love to see the checkpoint of this model. I think people would find some really interesting stuff taking it apart.

- https://www.reddit.com/r/gaming/comments/a4yi5t/original_doo...

- https://huggingface.co/CompVis/stable-diffusion-v-1-4-origin...

- https://cloud.google.com/tpu/docs/v5e

- https://github.com/Farama-Foundation/ViZDoom

- https://zdoom.org/index

snickmy · 2024-08-28T08:52:37 1724835157

Those are valid points, but irrelevant for the context of this research.

Yes, the computational cost is ridicolous compared to the original game, and yes, it lacks basic things like pre-computing, storing, etc. That said, you could assume that all that can be either done at the margin of this discovery OR over time will naturally improve OR will become less important as a blocker.

The fact that you can model a sequence of frames with such contextual awareness without explictly having to encode it, is the real breakthrough here. Both from a pure gaming standpoint, but on simulation in general.

tobr · 2024-08-28T09:13:36 1724836416

I suppose it also doesn't really matter what kinds of resources the game originally requires. The diffusion model isn't going to require twice as much memory just because the game does. Presumably you wouldn't even necessarily need to be able to render the original game in real time - I would imagine the basic technique would work even if you used a state of the Hollywood-quality offline renderer to render each input frame, and that the performance of the diffusion model would be similar?

godelski · 2024-08-28T17:42:22 1724866942

Well the majority of ML systems are compression machines (entropy minimizers), so ideally you'd want to see if you can learn the assets and game mechanics through play alone (what this paper shows). Better would be to do so more efficiently than that devs themselves, finding better compression. Certainly the game is not perfectly optimized. But still, this is a step in that direction. I mean no one has accomplished this before so even with a model with far higher capacity it's progress. (I think people are interpreting my comment as dismissive. I'm critiquing but the key point I was making was about how there's likely better architectures, training methods, and all sorts of stuff to still research. Personally I'm glad there's still more to research. That's the fun part)

pickledoyster · 2024-08-28T09:12:15 1724836335

>you could assume that all that can be either done at the margin of this discovery OR over time will naturally improve OR will become less important as a blocker.

OR one can hope it will be thrown to the heap of nonviable tech with the rest of spam waste

godelski · 2024-08-28T09:09:23 1724836163

I'm not sure what you're saying is irrelevant.

1) the model has enough memory to store not only all game assets and engine but even hundreds of "plays".

2) me mentioning that there's still a lot of room to make these things better (seems you think so too so maybe not this one?)

3) an interesting point I was wondering to compare current state of things (I mean I'll give you this but it's just a random thought and I'm not reviewing this paper in an academic setting. This is HN, not NeurIPS. I'm just curious ¯ \ _ ( ツ ) _ / ¯)

4) the point that you can rip a game

I'm really not sure what you're contesting to because I said several things.

  > it lacks basic things like pre-computing, storing, etc.

It does? Last I checked neural nets store information. I guess I need to return my PhD because last I checked there's a UNet in SD 1.4 and that contains a decoder.

snickmy · 2024-08-28T10:09:02 1724839742

Sorry, probably didn't explain myself well enough

1) yes you are correct. the point i was making is that, in the context of the discovery/research, that's outside the scope, and 'easier' to do, as it has been done in other verticals (ie.: e2e self driving)

2) yep, aligned here

3) I'm not fully following here, but agree this is not NeurIPS, and no Schmidhuber's bickering.

4) The network does store information, it just doesn't store a gameplay information, which could be forced, but as per point 1, it is , and I think it is the right approach, beyond the scope of this research

godelski · 2024-08-28T17:13:34 1724865214

1) I'm not sure this is outside scope. It's also not something I'd use to reject a paper were I to review this in a conference. I mean you got to start somewhere and unlike reviewer 2 I don't think any criticism is rejection criteria. That'd be silly since lack of globally optimal solutions. But I'm also unconvinced this is proven my self-driving vehicles but I'm also not an RL expert.

3) It's always hard to evaluate. I was thinking about the ripping the game and so a reasonable metric is a comparison of ability to perform the task by a human. Of course I'm A LOT faster than my dishwasher at cleaning dishes but I'm not occupied while it is going, so it still has high utility. (Someone tell reviewer 2 lol)

4) Why should we believe that it doesn't store gameplay? The model was fed "user" inputs and frames. So it has this information and this information appears useful for learning the task.

danielmarkbruce · 2024-08-28T18:35:28 1724870128

Is it a breakthrough? Weather models are miles ahead of this as far as I can tell.

dTal · 2024-08-28T10:18:37 1724840317

>What's also interesting about this work is it's basically saying you can rip a game if you're willing to "play" (automate) it enough times and spend a lot more on storage and compute

That's the least of it. It means you can generate a game from real footage. Want a perfect flight sim? Put a GoPro in the cockpit of every airliner for a year.

phh · 2024-08-28T15:23:52 1724858632

> Want a perfect flight sim? Put a GoPro in the cockpit of every airliner for a year.

I guess that's the occasion to remind that ML is splendid at interpolating, but extrapolating, maybe don't keep your hopes too high.

Namely, to have a "perfect flight sim" using GoPros, you'll need to record hundreds of stalls and crashs.

godelski · 2024-08-29T17:46:38 1724953598

  > Want a perfect flight sim? Put a GoPro in the cockpit of every airliner for a year.

You're jumping ahead there and I'm not convinced you could do this ever (unless you're model is already a great physics engine). The paper itself has feeds the controls into the network. But a flight sim will be harder better you'd need to also feed in air conditions. I just don't see how you could do this from video alone, let alone just video from the cockpit. Humans could not do this. There's just not enough information.

dTal · 2024-08-29T21:32:24 1724967144

There's an enormous amount of information if your GoPro placement includes all the flight instruments. Humans can and do predict aircraft state t+1 by parsing a visual field that includes the instruments; that is what the instruments are for.

camtarn · 2024-08-28T12:16:49 1724847409

Plus, presumably, either training it on pilot inputs (and being able to map those to joystick inputs and mouse clicks) or having the user have an identical fake cockpit to play in and a camera to pick up their movements.

And, unless you wanted a simulator that only allowed perfectly normal flight, you'd have to have those airliners go through every possible situation that you wanted to reproduce: warnings, malfunctions, emergencies, pilots pushing the airliner out of its normal flight envelope, etc.

isaacfung · 2024-08-28T11:26:31 1724844391

The possibility seems far beyond gaming(given enough computation resources).

You can feed it with videos of usage of any software or real world footage recorded by a Go Pro mounted on your shoulder(with body motion measured by some sesnors though the action space would be much larger).

Such a "game engine" can potentially be used as a simulation gym environment to train RL agents.

dvngnt_ · 2024-08-28T23:07:58 1724886478

wouldnt make more sense to train using microsoft flight simulator the same way they did DOOM, but im not sure what the point is if the game already exists

Sohcahtoa82 · 2024-08-28T16:48:01 1724863681

It's always fun reading the dead comments on a post like this. People love to point how how pointless this is.

Some of ya'll need to learn how to make things for the fun of making things. Is this useful? No, not really. Is it interesting? Absolutely.

Not everything has to be made for profit. Not everything has to be made to make the world a better place. Sometimes, people create things just for the learning experience, the challenge, or they're curious to see if something is possible.

Time spent enjoying yourself is never time wasted. Some of ya'll are going to be on your death beds wishing you had allowed yourself to have more fun.

ploxiln · 2024-08-28T19:02:15 1724871735

The skepticism and criticism in this thread is against the hype of AI, it's implied by people saying "this is so amazing" that they think that in some near future you can create any video game experience you can imagine by just replacing all the software with some AI models, rendering the whole game.

When in reality this is the least efficient and reliable form of Doom yet created, using literally millions of times the computation used by the first x86 PCs that were able to render and play doom in real-time.

But it's a funny party trick, sure.

joegibbs · 2024-08-29T00:25:36 1724891136

Yes it's less efficient and reliable than regular Doom, but it's not just limited to Doom. You could have it simulate a very graphically advanced game that barely runs on current hardware and it would run at the exact same speed as Doom.

swivelmaster · 2024-08-29T16:12:58 1724947978

Technically this is correct, but in order for this to work you have to

1: Build the entire game first

2: Record agents playing hundreds/thousands/millions of hours of it

3: Be able to run the simulation at far higher resolution than what's in the demo videos here for it to even matter that the hypothetical game is 'very graphically advanced'

This is the most impractical way yet invented to 'play' a game.

Gooblebrai · 2024-08-28T16:51:33 1724863893

So true. The hustle culture is an spreading disease that has replaced the fun maker culture from the 80s/90s.

It's unavoidable though. Cost of living being increasingly expensive and romantization of entrepreneurs like they are rock stars leads towards this hustle mindset.

nuancebydefault · 2024-08-30T20:36:28 1725050188

Today this exercise feels pointless. However. I remember the days when there were articles written about the possibility for "internet radio". In stead of good old broadcasting waves through the ether and simply thousands of radios tuning in, some server would send a massive amount of packets over a massive kilometers of copper to thousands of endpoints. Ad absurdum the endpoints would even send ack packages upstream to the poor server to keep connections alive. It seemed like a huge waste of computing power, wire and energy.

And here we are, binging netflix movies over such copper wires.

I'm not saying games will be replaced by diffusion models dreaming up next images based on user input, but a variation of that might end up in a form of interactive art creation or a new form of entertainment.