Yes. All video games are, by definition, interactive videos. What I imagine you'...

Yes.

All video games are, by definition, interactive videos.

What I imagine you're asking about is, a typical game like Doom is effectively a function:

  f(internal state, player input) -> (new frame, new internal state)

where internal state is the shape and looks of loaded map, positions and behaviors and stats of enemies, player, items, etc.

A typical AI that plays Doom, which is not what's happening here, is (at runtime):

  f(last frame) -> new player input

and is attached in a loop to the previous case in the obvious way.

What we have here, however, is a game you can play but implemented in a diffusion model, and it works like this:

  f(player input, N last frames) -> new frame

Of note here is the lack of game state - the state is implicit in the contents of the N previous frames, and is otherwise not represented or mutated explicitly. The diffusion model has seen so much Doom that it, in a way, internalized most of the state and its evolution, so it can look at what's going on and guess what's about to happen. Which is what it does: it renders the next frame by predicting it, based on current user input and last N frames. And then that frame becomes the input for the next prediction, and so on, and so on.

So yes, it's totally an interactive video and a game and a third thing - a probabilistic emulation of Doom on a generative ML model.