I assume think they start with the "signal" and "noise" as separate audio files, and then they play them together in order to create a synthetic noisy input. Then they can train the output against only the signal so that it will learn to filter out the noise.