See Vanderkooy and Lipshitz 1987 for why.
What dithering does is it decorrelates the quantization noise with the signal. Absent it, quantization generates harmonic spurs. In theory, on a very clean and noiseless signal these harmonic spurs might be more audible than you'd expect from the overall quantization level.
In practice, 16 bits is enough precision that these harmonics are inaudible even in fairly pathlogical cases. But quantization eliminates the potential problem by replacing the harmonic content with white noise.
Adding noise on playback just adds noise, it would not remove the harmonic generation.
The _best_ kind of dithering scheme is a subtractive dither, where noise is added before quanitization and then the _same_ noise is subtracted from the dequantized signal on playback. This is best in the sense that it's the scheme that completely eliminates the distortion with the least amount of additional noise power. But it's not ever used for audio applications due to the additional complexity of managing the synchronized noise on each side.
Mersenne twister with a shared seed in metadata?
Now supposing we add noise to our signal before we quantize. A given pixel at 25% gray (which under the previous scheme would always end up solid black) now has a 25% chance of ending up white. A contiguous block of such pixels will have an average value of 25% gray, even though an individual pixel can only be black or white. Thus, by flip-flopping between the two closest values ("dithering") in statistical proportion to the original signal, information is preserved.
In audio if I recall correctly it is also important to avoid obvious noise modulation.