Could you explain how the interaction between the spatial autoencoder (ViT-based) and the latent diffusion backbone (DiT-based) enables both rapid response to real-time input and maintains temporal stability across long gameplay sequences? Specifically, how does dynamic noising integrate with these components to mitigate error compounding over time in an autoregressive setup?
reply