Hacker News new | past | comments | ask | show | jobs | submit login

I think you may want to look closer at Shannon’s source coding theorem; The Shannon entropy of the output of a compression algorithm will be higher than the entropy of the source as identifiable patterns are eliminated. Otherwise the theorem would trivially contradict itself.



Shannon's source coding theorn says that the entropy in a compressed information is at most the entropy of the uncompressed information. If you add entropy to a compressed algorithm, you are by definition adding noise to the SNR of a signal.


We’re not talking about a noisy channel here, so I’m not sure where you’re getting the SNR from. I think we’re talking about entropy of different distributions here so let’s cut to a concrete example relevant to your original claim (that compression doesn’t help reduce the impact of repeated blocks in ECB by reducing the rate of repeated blocks).

Suppose we have some string of bytes. When we split it into aligned 16-byte blocks (let’s assume it divides evenly for simplicity), we find that the distribution of these blocks are not evenly distributed. For example, 1% of blocks turn out to be the same, which given the number of symbols in this code is massively out of proportion.

We apply a Huffman code using the 16-byte blocks present in the message as the alphabet and their observed statistics for this particular message (if that aspect bothers you, you can assume we pretend the dictionary to the message). Huffman codes are optimal for per-symbol encoding.

Suppose we re-evaluate the distribution of 16-byte blocks in the compressed data; will this distribution have higher entropy (meaning there will be fewer duplicate blocks to exploit ECB with) or not?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: