I've always favored this down-to-earth characterization of the entropy of a discrete probability distribution. (I'm a big fan of John Baez's writing, but I was surprised glancing through the PDF to find that he doesn't seem to mention this viewpoint.)
Think of the distribution as a histogram over some bins. Then, the entropy is a measurement of, if I throw many many balls at random into those bins, the probability that the distribution of balls over bins ends up looking like that histogram. What you usually expect to see is a uniform distribution of balls over bins, so the entropy measures the probability of other rare events (in the language of probability theory, "large deviations" from that typical behavior).
More specifically, if P = (P1, ..., Pk) is some distribution, then the probability that throwing N balls (for N very large) gives a histogram looking like P is about 2^(-N * [log(k) - H(P)]), where H(P) is the entropy. When P is the uniform distribution, then H(P) = log(k), the exponent is zero, and the estimate is 1, which says that by far the most likely histogram is the uniform one. That is the largest possible entropy, so any other histogram has probability 2^(-c*N) of appearing for some c > 0, i.e., is very unlikely and exponentially moreso the more balls we throw, but the entropy measures just how much. "Less uniform" distributions are less likely, so the entropy also measures a certain notion of uniformity. In large deviations theory this specific claim is called "Sanov's theorem" and the role the entropy plays is that of a "rate function."
The counting interpretation of entropy that some people are talking about is related, at least at a high level, because the probability in Sanov's theorem is the number of outcomes that "look like P" divided by the total number, so the numerator there is indeed counting the number of configurations (in this case of balls and bins) having a particular property (in this case looking like P).
There are lots of equivalent definitions and they have different virtues, generalizations, etc, but I find this one especially helpful for dispelling the air of mystery around entropy.
Hey did you want to say relative entropy ~ rate function ~ KL divergence. Might be more familiar to ML enthusiasts here, get them to be curious about Sanov or large deviations.
That's right, here log(k) - H(p) is really the relative entropy (or KL divergence) between p and the uniform distribution, and all the same stuff is true for a different "reference distribution" of the probabilities of balls landing in each bin.
For discrete distributions the "absolute entropy" (just sum of -p log(p) as it shows up in Shannon entropy or statistical mechanics) is in this way really a special case of relative entropy. For continuous distributions, say over real numbers, the analogous quantity (integral of -p log(p)) isn't a relative entropy since there's no "uniform distribution over all real numbers". This still plays an important role in various situations and calculations...but, at least to my mind, it's a formally similar but conceptually separate object.
Think of the distribution as a histogram over some bins. Then, the entropy is a measurement of, if I throw many many balls at random into those bins, the probability that the distribution of balls over bins ends up looking like that histogram. What you usually expect to see is a uniform distribution of balls over bins, so the entropy measures the probability of other rare events (in the language of probability theory, "large deviations" from that typical behavior).
More specifically, if P = (P1, ..., Pk) is some distribution, then the probability that throwing N balls (for N very large) gives a histogram looking like P is about 2^(-N * [log(k) - H(P)]), where H(P) is the entropy. When P is the uniform distribution, then H(P) = log(k), the exponent is zero, and the estimate is 1, which says that by far the most likely histogram is the uniform one. That is the largest possible entropy, so any other histogram has probability 2^(-c*N) of appearing for some c > 0, i.e., is very unlikely and exponentially moreso the more balls we throw, but the entropy measures just how much. "Less uniform" distributions are less likely, so the entropy also measures a certain notion of uniformity. In large deviations theory this specific claim is called "Sanov's theorem" and the role the entropy plays is that of a "rate function."
The counting interpretation of entropy that some people are talking about is related, at least at a high level, because the probability in Sanov's theorem is the number of outcomes that "look like P" divided by the total number, so the numerator there is indeed counting the number of configurations (in this case of balls and bins) having a particular property (in this case looking like P).
There are lots of equivalent definitions and they have different virtues, generalizations, etc, but I find this one especially helpful for dispelling the air of mystery around entropy.