
What is the Kullback-Leibler divergence? - rgbimbochamp
https://saru.science/tech/2018/02/15/kl-divergence-explanation.html
======
ssivark
To summarize succinctly, KL(q||p) quantifies how badly you screw up if the
true distribution is “q” and you instead think it is “p”.

Note that KL divergence is not symmetric! Eg: If the true distribution of coin
tosses is 100% heads and your model has 50/50, you won’t mess up big —
compared with when the true coin is 50/50 and your model is 100 percent heads
(and you would have been willing to bet a LOT of money that there will be no
tails in the outcome).

In this technical sense, it is preferable to be conservative than overly
confident.

~~~
cultus
As a side note, KL divergence is actually symmetric to the second order: If
you have a distribution "p" parameterized by x, the divergence between p|x_0
from a nearby value p|x_1 is approximately symmetric.

This is useful because the Hessian of the KL divergence (if it exists) with
respect to the parameters of P defines a Riemannian metric called the Fisher
information metric. This provides a good distance measure that takes into
account how much the information content, or entropy, changes as you move
around in parameter space.

This is really useful for fast online variational Bayesian methods. Gradient
descent in the Euclidean space of the parameters can be pretty lousy, but
using the "natural gradient" using the Fisher information metric gives a more
natural definition of distance.

The Fisher information can be derived more generally if the Hessian doesn't
exist: it is also the variance of the gradient of the log-likelihood.

~~~
srean
> As a side note, KL divergence is actually symmetric to the second order:

This is backwards. Its a tautology.

Any function no matter how egregiously asymmetric is locally symmetric if it
is twice differentiable at that point. This is so by construction, you are
approximating it locally by the best possible quadratic [hence locally
symmetric] curve [surface].

Hence, the claim about symmetry is not false, but it is vacuous. Much like the
claim that the equation of French curve is such that no matter how you turn it
at its highest point it makes a tangent with the horizontal.

That said Fisher information metric does have many uses.

------
Patient0
I've recently discovered this excellent lecture series by David Mackay
available on YouTube:
[https://youtu.be/y5VdtQSqiAI](https://youtu.be/y5VdtQSqiAI)

He also wrote the accompanying text book which is available for free download:
[http://www.inference.phy.cam.ac.uk/itprnn/book.pdf](http://www.inference.phy.cam.ac.uk/itprnn/book.pdf)

I was really impressed by these lectures, and was dismayed to learn that he
died from cancer a couple of years ago.

------
beagle3
I wish information theory was part of math/cs/engineering curriculum in more
places.

The basics are fundamental to many areas of science (especially if they touch
probability in any way), intuitive, and mostly accessible with just a couple
of handwaves.

~~~
saiya-jin
We had it in our university, actually quite deep. It was done by head of IT
department on our faculty, long-retired guy who was supposedly brilliant as
theoretical scientist and had high reputation all over Europe in his field.

It was done in most horrible and unmotivating way - A4 page or two densely
covered with all greek letters and some more, and 98% of the content were just
proofs of relatively simple statements. On all tests/exams, only the proofs
were tested (so you either gave 1-2 pages of a single proof per question or
blank page and could effectively go home as failed).

Subjectively it was the worst set of classes during whole 5 years (and we had
some serious IT-unrelated crap because were part of electro-engineering
faculty back then), completely mandatory, no credit system back then to make
it up via something else. Out of 100 people in 3rd and 4th year, at that point
completely focused on Software engineering studies only, maybe 2-3 had proper
clue and could do the stuff out of their head.

Needless to say, most people thrown out of university failed exactly these
courses, and quite a few were brilliant coders, very successful afterwards.
They just couldn't be bothered with bad approach this guy took.

It is very important topic, but should be taught in a sane way. This guy
couldn't do it, it alienated the topic to every single student for years to
come (even to those few who got it all), and nobody at school dared to
challenge him and his methods.

~~~
sn41
I actually sympathize with the theoretician (disclaimer: I work in
information-theoretic areas). Information theory is easy to motivate at a
first cut, but if you want to really understand it, then there are some hairy
issues. There is a lot of slip between the cup and the lip when it comes to
information theory (Shannon himself made several serious errors in his
original 1948 paper which took decades to fully work out).

Many seemingly "obvious" facts in information theory are tricky to show. Some
examples:

(1) From the article: Cross entropy is always greater than or equal to Entropy
since we are coding the wrong distribution. How do you show this? For any two
probability vectors (p,q), can we say H(p) >= H(p,q)? Any proof I know
involves some delicate usage of Jensen's inequality. (By the way, I feel that
the notation used by the author is non-standard. H(p,q) usually stands for the
joint entropy, which is quite different.)

(2) Another famous fact about entropy : conditioning always reduces entropy -
for any two random variables X, Y, we have H(X|Y) <= H(X) and H(Y|X) <= H(Y).
This is called Shannon's inequality, and the proof involves a subtle trick.

(3) You can easily show that if p=q, then KL(p||q)=0. But it is also true that
if KL(p||q)=0, then p=q. The second fact is quite tricky, and used to appear
as a question in Ph.D qualifying exams.

~~~
FiatLuxDave
I have read Shannon's 1948 paper a few times, and while I have noticed one or
two errors, I highly doubt that I know all of them. Is there any chance you
could point me in the right direction to research the several serious errors
you mention? I'd greatly appreciate the pointers.

~~~
sn41
A good introductory lecture is by Emre Telatar of EPFL. It's a great talk, and
presents a unique view on Shannon's paper. [Of course, I am assuming that
_you_ are not Emre :) ] It mentions some of the errors in Shannon's paper:

[https://www.youtube.com/watch?v=9FlHZwEpvPE&feature=youtu.be](https://www.youtube.com/watch?v=9FlHZwEpvPE&feature=youtu.be)

There are some more errors in his formulation of what eventually came to be
known as the Shannon-McMillan-Breiman theorem. These are the errors I know of,
there may be more.

The greatness of the paper is its revolutionary conception of a new area ab
initio. It contains errors, but that is overshadowed by what it achieved and
brought forth.

~~~
FiatLuxDave
Thank you!

------
atrudeau
Shannon's dissertation is a great introduction (:p) to entropy.
[https://dspace.mit.edu/handle/1721.1/11173](https://dspace.mit.edu/handle/1721.1/11173)

------
cryptonector
This divergence feels a lot like making a Huffman encoding table given a
prediction of probability distribution then measuring how efficient that turns
out to be by comparison to a Huffman encoding table based on the probability
distribution you get from the real data after the fact.

------
jules
The KL divergence is also called relative entropy. Unlike the ordinary
entropy, relative entropy is invariant under parameter transformations. The
maximum relative entropy principle generalises Bayesian inference. The
distribution relative to which you're computing the entropy plays the role of
the prior.

By the way, I find the following way to rewrite the entropy easier to
understand because all quantities are positive:

sum(-p_i log(p_i)) = sum(p_i log(1/p_i)) = E[log(1/p_i)]

log(1/p_i) tells you how many bits you need to encode an event with
probability p_i. The more unlikely the event, the more bits you need. The
entropy is the expected number of bits you need.

------
derEitel
Great, intuitive explanations with a nice mix of code and formulas. Only I
found the GIFs to be very annoying while reading, especially as they do not
add to the content.

------
caiocaiocaio
Lovely article, but grey-on-white and a small, thin display font meant I had
to go into developer tools to be able to read it without getting a headache.

~~~
h2onock
It was nice and clear on my phone

------
doombolt
I have a hunch that space engineers have suddently invented Huffman coding.

(Which leads to a general observation of "just throw in transparent
compression instead of optimizing your data format")

EDIT: s/encryption/compression/

~~~
cryptonector
I don't know why you got downvoted with no explanation. My observation is the
same: this is just about measuring the efficiency of one's Huffman tables
given actual probability distribution after the fact.

