To train the inverse diffusion model, we take a clean image x0 and generate a no...

To train the inverse diffusion model, we take a clean image x0 and generate a noisy sample xt which is from the distribution over points that x0 would visit following t steps of forward diffusion. For any value of t, any xt which is visited by x0 is also visited by some other clean images x0' when we run t steps of diffusion starting from those x0'. In general, there will be many such x0' for any xt which our initial x0 might visit after t steps of forward diffusion.

If t is small and the noise schedule for diffusion adds small noise at each step, then the inverse conditional p(x0 | xt) which we want to learn will be approximately a unimodal Gaussian. This is an intrinsic property of the forward diffusion process. When t is large, or the diffusion schedule adds a lot of noise at each step, the conditional p(x0 | xt) will be more complex and include a larger fraction of the images in the training set.

"If you don't add any noise, you get nothing more from doing 1 step than you do from 10 or 50." -- there are actually models which deterministically (approximately) integrate the reverse diffusion process SDE and don't involve any random sampling aside from the initial xT during generation.

For example, if t=T, where T is the total length of the diffusion process, then xt=xT is effectively an independent Gaussian sample and the inverse conditional p(x0 | xT) is simply p(x0) which is the distribution of the training data. In general, p(x0) is not a unimodal isotropic Gaussian. If it was, we could just model our training set by fitting the mean and (diagonal) covariance matrix of a Gaussian distribution.

"I'm really not sure about that... :/" -- the forward diffusion process initiated from x0 iteratively removes information about the starting point x0. Depending on the noise schedule, the rate at which information about x0 is removed by the addition of noise can vary. Whether we're in the continuous or discrete setting, this means the inverse conditional p(x0 | xt) will increase in entropy as t goes from 1 to T, where T is the max number of diffusion steps. So, when we generate an image by running the inverse diffusion process the conditional p(x0 | xt) will have shrinking entropy as t is now decreasing.

The quote (1) you reference is about how trying to directly optimize the noise schedule is more challenging when working with discrete inputs/latents. Whether the noise schedule is trained, as in their continuous case, or defined a priori, as in their discrete case, each step of forward diffusion removes information about the input and what I said about shrinking entropy of the conditional p(x0 | xt) as we run reverse diffusion holds. In the case of current SOTA diffusion models, I believe the noise schedules are set via hyperopt rather than optimized by SGD/ADAM/etc.