All video games are, by definition, interactive videos.
What I imagine you're asking about is, a typical game like Doom is effectively a function:
f(internal state, player input) -> (new frame, new internal state)
where internal state is the shape and looks of loaded map, positions and behaviors and stats of enemies, player, items, etc.
A typical AI that plays Doom, which is not what's happening here, is (at runtime):
f(last frame) -> new player input
and is attached in a loop to the previous case in the obvious way.
What we have here, however, is a game you can play but implemented in a diffusion model, and it
works like this:
f(player input, N last frames) -> new frame
Of note here is the lack of game state - the state is implicit in the contents of the N previous frames, and is otherwise not represented or mutated explicitly. The diffusion model has seen so much Doom that it, in a way, internalized most of the state and its evolution, so it can look at what's going on and guess what's about to happen. Which is what it does: it renders the next frame by predicting it, based on current user input and last N frames. And then that frame becomes the input for the next prediction, and so on, and so on.
So yes, it's totally an interactive video and a game and a third thing - a probabilistic emulation of Doom on a generative ML model.
All video games are, by definition, interactive videos.
What I imagine you're asking about is, a typical game like Doom is effectively a function:
where internal state is the shape and looks of loaded map, positions and behaviors and stats of enemies, player, items, etc.A typical AI that plays Doom, which is not what's happening here, is (at runtime):
and is attached in a loop to the previous case in the obvious way.What we have here, however, is a game you can play but implemented in a diffusion model, and it works like this:
Of note here is the lack of game state - the state is implicit in the contents of the N previous frames, and is otherwise not represented or mutated explicitly. The diffusion model has seen so much Doom that it, in a way, internalized most of the state and its evolution, so it can look at what's going on and guess what's about to happen. Which is what it does: it renders the next frame by predicting it, based on current user input and last N frames. And then that frame becomes the input for the next prediction, and so on, and so on.So yes, it's totally an interactive video and a game and a third thing - a probabilistic emulation of Doom on a generative ML model.