
A Friendly Introduction to Cross-Entropy Loss (2016) - mwulf
https://rdipietro.github.io/friendly-intro-to-cross-entropy-loss/
======
happy4crazy
I personally don't find the "bits"y explanation of entropy/cross-entropy/KL
etc. to be all that intuitive; as fundamental as it may be, I don't think
about compression/encodings all that often. I've always preferred the
"surprise" interpretation:
[http://charlesfrye.github.io/stats/2016/03/29/info-theory-
su...](http://charlesfrye.github.io/stats/2016/03/29/info-theory-surprise-
entropy.html)

In short: given some event of probability p, -log p = log 1/p is its
"surprise". (If p = 1, log 1/1 = 0, so zero surprise; as p -> 0, the surprise
gets bigger and bigger; and the surprise for two independent events, p = p1 *
p2, is the sum of their individual surprises: log 1/(p1*p2) = log 1/p1 + log
1/p2.)

The entropy of a distribution is its average surprise: Sum/Integral of p log
1/p.

KL(p || q) is your excess surprise if you think something's distribution is q
but it's actually p: Sum/Integral p (log 1/q - log 1/p). The KL divergence is
always non-negative because surely if you think the distribution is q but it's
actually p, on average you're going to be more surprised than someone who
knows it's p.

~~~
heavenlyblue
If you're introducing a new term solely for the sake of explaining something,
then your fundamentals are wrong.

Bits are fundamental to understanding why we can encode simple numbers in a
GNN. If you don't understand that, then surprise-surprise - you need to create
another, possibly misleading further down the line, framework.

------
rerx
The cross entropy can be understood as the expected number of bits you need to
specify some event if you assume an underlying distribution. It is large if
your assumed distribution is very different from the true distribution. This
is indeed helpful intuition for making sense of formulas in machine learning.

------
stared
As a big fan of entropy (see this post:
[https://www.reddit.com/r/MachineLearning/comments/8im9eb/d_c...](https://www.reddit.com/r/MachineLearning/comments/8im9eb/d_crossentropy_vs_meansquared_error_loss/)
and links there) I like this explanation:

[https://www.countbayesie.com/blog/2017/5/9/kullback-
leibler-...](https://www.countbayesie.com/blog/2017/5/9/kullback-leibler-
divergence-explained)

------
dangom
Readers may be interested in checking out this blog post:
[http://colah.github.io/posts/2015-09-Visual-
Information/](http://colah.github.io/posts/2015-09-Visual-Information/)

