When you train your neural network to minimise cross-entropy that's literally the same as making it better as a building block in an arithmetic coding data compressor. See https://en.wikipedia.org/wiki/Arithmetic_coding
Indeed, KL-divergence can be seen as the difference between the average number of bits required to arithmetically encode a sample from a given distribution, using symbol probabilities from both the original distribution and an approximating distribution.
I've encountered it >10 years ago and it felt novel that compression is related to intelligence and even AGI.