An Elementary Introduction to Information Geometry [pdf]

antixk · on Oct 1, 2020

As a student of Information Geometry, let me provide some context why this is such an exciting field. Usually geometry reminds someone of triangle and other shapes that we see in our immediate surroundings, which is called as Euclidean or just flat geometry i. e a space where Euclid's axioms are valid. But there is a lot more to the story, we can bend/reformulate or even exclude certain axioms to conjure up new spaces - such as hyperbolic geometry or in general some complex curved geometry. Turns out, these indeed have wide applications in our real world.

Now what's all this gotta do with Information? Usually, information is represented in terms of statistical distributions, from Shannon's information theory. What the early founders of IG observed is that, these statistical distributions can be represented as points on some curved space called a Statistical Manifold. Now, all the terms used in information theory can be reinterpreted in terms of geometry.

So, why is it so exciting? Well in Deep Learning people predominantly work statistical distributions, some even without realising it. All our optimizations involve reducing distance between some statistical distributions like the distribution of of the data and the distribution that the neural network is trying to model. Turns out, such optimization when done in the space of statistical manifold, amounts to the gradient descent that we all know and love. All the gradient based optimisations are only approximations to the local geometry like gradient(local slope) , Hessian(local quadratic approximation of curvature), but optimisation in the statistical manifold can yield the exact curvature and thus are more efficient. This method is called Natural Gradient.

Hope this helps.

amirkdv · on Oct 1, 2020

That's indeed mathematically very exciting and reason enough to study IG.

But does IG allow us to reason about Neural Nets in new ways that could move the needle on open questions about information representation in ANNs or, even better, BNNs?

antixk · on Oct 1, 2020

Good question. Most people are focussing on the natural gradient and making it as efficient as SGD. But some have been exploring if we can introduce inductive bias in the function space rather than the weight space, using IG. But it is still quite a new field.

ogogmad · on Oct 1, 2020

BNNs?

jmt_ · on Oct 1, 2020

They're distinguishing between artificial neural networks(ANN) and biological neural networks(BNN)

7373737373 · on Oct 1, 2020

Perhaps one day this could be used to reduce distance between mathematical proofs in "proof space"? :)

madhadron · on Oct 1, 2020

So it's the geometry of the set of probability measures on a pre-measure space? If so, that sounds like something that might be interesting for point processes as well.

krychu · on Oct 1, 2020

But it also amounted to gradient descent without any mention of information geometry.

benrbray · on Oct 1, 2020

As someone who knows a little information theory and enough differential geometry to be dangerous, I have to admit that I've tried numerous to learn about Information Geometry but haven't been able to penetrate the surface.

The introductory articles like this tend to get bogged down in the definitions, and research-level articles are pretty impenetrable. If anyone here has a little more insight into this field, I'd be really interested to know more about the impact IG has had on other areas (be it applications to the real world, or to helping organize our understanding of another field of math).

credit_guy · on Oct 1, 2020

Let me take a stab at this.

A while ago I read about a guy's innovative idea to classify languages using zip (the archiving software). This idea is cute in itself, if you are curious you can check [1]. My rough summary would be like this: when you zip a French text, the zip facility first spends some time to train itself, and finds the letter frequencies and then compresses the text using these frequencies. If you train zip to a French text, but then use the frequencies to compress a Spanish text, you'll do worse than if you use the frequencies of the Spanish language itself. However, if you use the French frequencies to compress Hungarian text, you will do much, much worse. Then you ask the question: if for a certain language A, a zipped text is 1 MB using the language's own frequencies but 1.3MB using the frequencies form a different language B, then you say the distance between A and B is 0.3. If it takes 1.05 MB, then the distance is 0.05. If two languages are very similar like Nrwegian and Swedish, this distance is very small, if they are very different, the distance could be quite large.

This concept of distance is essentially the KL-divergence [2]. It's not symmetric, so the distance between A and B is different compared to the distance between B and A, so it's not exactly what mathematicians call a "distance". But it's very useful nonetheless. When you train a classifier for example, many times you measure the quality of the classifier using the KL-divergence, but in that context people use the term cross-entropy.

You can for example do a cluster analysis and find groups of languages that are related, like the Romance languages, or Slavic languages, and this without knowing anything about these languages at all.

Now, how many languages are out there? Let's say 5000. Imagine you want to create one of those cool graphs where each language is linked with its closes neighbors and the length of that edge is equal to this "distance". You won't be able to do this exactly, because you are constrained by the geometry of the flat plane, but if you allow yourself to go in higher dimensions, you can do that ( * ). This graph will then have some shape, it's going to be somewhat curved.

For example, if you restrict yourself to the Romance languages, and you put Latin in the center and put the current languages (Spanish, Portuguese, Catalan, French, Italian, Romanian, there are probably a few more) around it, you get something that looks like a circle. Is the circumference of this "circle" higher, or lower than 2piradius? If it's lower, then you say the space has positive curvature, if it's higher, negative.

So, information geometry is 1. taking care of the various details I brushed off (like the fact that you actually use the square root of the KL-divergence to define the distance) and 2. studying the geometric properties of the metric space you produce.

( * ) This fact is known as the Nash embedding theorem, the same Nash who got a Nobel for the Nash equilibrium.

[1] https://link.springer.com/chapter/10.1007/978-3-540-31865-1_...

[2]https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_diver...

scottlocklin · on Oct 1, 2020

The most intriguing use cases I've seen is for when it gets used to create better estimators (using the geometry of the Fisher information matrix). I guess it works particularly well on things like the EM algorithm, but there are probably lots of signal processing-ey applications.

People babbling about neural nets; that's probably wrong -never seen an actual application there, though TDA does very interesting things for neural approaches for reasons that should be obvious.

I kind of wish I had done the GR course with Carlo Rovelli and Ezra Newman back when I had the chance; the first semester was all differential geometry, and those dudes were great at explaining math. Since the Einstein equations are more nasty than the Fisher information matrix, it would have made all the info-geo stuff pretty accessible.

onecommentman · on Oct 1, 2020

Worth a read just for the clearest description of differential geometry I’ve ever read (first ten pages). I don’t think you could learn it from the intro, but you can learn why you might want to know it if you have a “basic mathematical sophistication”.

The only thing I remember about Information Geometry in a practical sense was it was related to EM (Expectation-Maximization) algorithms in some way. Most of the associated information geometry research was done in Japan, so logical that Sony would have an interest.

Information geometry sounds like a “theory of everything” sort of abstraction that either should lead to breakthrough insights, or just be a forensic sort of postgame wrap-up explaining what’s really going on in that most practical yet ugliest of all mathematical disciplines, statistics. What I remember seemed to me to be a lot of “Adventures in Dualities”, so your mind has to be comfortable with that.

Then again, who needs recreational drugs when you have pure math? Wonder if DEA has considered that motto.

pmiller2 · on Oct 1, 2020

> Worth a read just for the clearest description of differential geometry I’ve ever read (first ten pages). I don’t think you could learn it from the intro, but you can learn why you might want to know it if you have a “basic mathematical sophistication”.

I'm going to read at least the first part because of this comment. Thanks.

webmaven · on Oct 1, 2020

> The only thing I remember about Information Geometry in a practical sense was it was related to EM (Expectation-Maximization) algorithms in some way. Most of the associated information geometry research was done in Japan, so logical that Sony would have an interest.

Hmm. Expectation Maximization, as opposed to Regret Minimization?

kgwgk · on Oct 1, 2020

https://en.wikipedia.org/wiki/Expectation–maximization_algor...

  The EM iteration alternates between performing an expectation (E) step, which creates a function for the expectation of the log-likelihood evaluated using the current estimate for the parameters, and a maximization (M) step, which computes parameters maximizing the expected log-likelihood found on the E step. These parameter-estimates are then used to determine the distribution of the latent variables in the next E step.

enugu · on Oct 1, 2020

Great post.

Equation 20 defining curvature has a typo in the first term on the right - It should be ∇_X∇_Y Z instead of ∇_X∇_Y X. Might be useful for anyone reading it the first time to avoid confusion.

da39a3ee · on Oct 1, 2020

Without meaning to denigrate pure theory, does anyone know what one might be able to do with this theory?

(I like reading about theory but I'm clearly going to have to put this one off until I've learned some differential geometry, hopefully in this lifetime.)

benrbray · on Oct 1, 2020

In case you're interested, my recommendation for learning differential geometry is Lee's "Smooth Manifolds". I prefer the 1st edition to the 2nd [0], but both present a much more clear overview of the basic definitions than I've seen anywhere else.

I think differential geometry is a field where it really helps to think about the "types" of each object, and to write the type signatures down explicitly for every function of interest. Unfortunately even Lee's book doesn't always do this, but I think it'll help you when taking notes.

If you just want to learn about the "spirit" of DG and not the rigorous details, I highly highly recommend Keenan Crane's "Discrete Differential Geometry" book [1] and course [2,3]. It's very computationally-oriented and you'd be in a good position to read more about the continuous case afterwards.

[0] due to differences in order of topics and typographical style [1] http://www.cs.cmu.edu/~kmcrane/Projects/DGPDEC [2] http://brickisland.net/DDGSpring2020/ [3] https://www.youtube.com/watch?v=FRvhgkGKfSM&list=PL9_jI1bdZm...

da39a3ee · on Oct 1, 2020

Thanks!

> I think differential geometry is a field where it really helps to think about the "types" of each object, and to write the type signatures down explicitly for every function of interest.

Do you have an opinion on https://mitpress.mit.edu/books/functional-differential-geome...?

I haven't gone through any of it, but I have gone through some of their companion book on classical mechanics. https://mitpress.mit.edu/sites/default/files/titles/content/...

benrbray · on Oct 1, 2020

That book has been on my reading list for a while, but I haven't yet done more than skim the first couple chapters. So, disclaimer: The below is my "HOT TAKE" (extremely unsubstantiated opinion) that you should definitely take with a grain of salt!

I really like the idea of providing code+math together! I applaud their attempt at using code to clarify differential geometry notation, which is very overloaded.

However, I am extremely skeptical that using Scheme (or any Lispy language) is the best choice here. I'm guessing the choice of Scheme has little to do with pedagogical value and more to do with the fact that Scheme was created by Sussman, one of the authors.

I think Scheme is a poor choice here, mostly due to its lack of static types / type annotations. I'm also not a fan of S-expressions, but that's a different can of worms. To me, the lack of type information makes Scheme even more difficult to read than appropriately-parenthesized math notation. I think Haskell, Idris, or Julia might be more effective for clear communication of mathematical ideas.

I think this quote sums up my view nicely:

> "Dynamic typing" The belief that you can't explain to a computer why your code works, but you can keep track of it all in your head. (https://chris-martin.org/2015/dynamic-typing)

The book is probably very effective at teaching differential geometry to a target audience that already knows Scheme inside and out. But to everyone else? I'm not so sure.

da39a3ee · on Oct 1, 2020

Yes, that has always been my feeling about SICM (their mechanics version). It was very helpful to me to see the Euler-Lagrange equations explained in the way they did, i.e. using Spivak's notation for differential calculus, as well as lisp. But they even state in the introduction that the reason they are taking that approach is because the traditional Leibniz-ish notation is full of type errors, so as you say there seems to be a huge question about why they didn't tackle it with a language with a rich type system like Haskell or Idris. I know they did lots of clever things with schema some of which would be impossible to realize similarly in non-lisps, but still.

currymj · on Oct 1, 2020

natural gradient descent is big in reinforcement learning -- proximal policy optimization (popular RL algorithm, most famously employed for OpenAI DotA bot) can be thought of as a cheap approximation to it. people also use something called KFAC, another approximation of natural gradient.

you can understand everything in those papers without understanding differential geometry, just basic linear algebra -- the core idea is just preconditioning gradients by the inverse of the Fisher information matrix.

the deeper theoretical stuff, i have no idea about, it seems like it must be fascinating though!

muska3 · on Oct 1, 2020

Almost 50 self-citations... Is this normal?

psychoslave · on Oct 1, 2020

Everything is normal, according to some definition of normal. Did you mean with a normal definition of normal?

ngcc_hk · on Oct 2, 2020

https://news.ycombinator.com/item?id=24657225

Seems better intro.

mellosouls · on Oct 1, 2020

https://arxiv.org/abs/1808.08271

onecommentman · on Oct 1, 2020

Great citation. The title “An Elementary Introduction to Information Geometry” is a little silly. This is just a short introduction to information geometry, I don’t think there is anything more “advanced” floating around out there.

(I don’t know why the culture of pure math persists in this. I throw it in with proof by intimidation. Makes pure math types sound like they’re stuck intellectually/socially at being highly precocious 14 year olds. It’s not the individual, it’s the culture.)

Chentsov, that’s a name I haven’t heard in a long time. There’s a result that’s worth being generally known by the data analysis community. The article looks like it gets there more clearly than others I read a while ago.

jhidding · on Oct 1, 2020

Hear, hear! I think this problem is also beautifully put by the Knuth in Surreal Numbers. Results are boiled down to the smallest possible representation, without giving a historical/pedagogical overview, or even the slightest hint about how the author himself also struggled with the material.

This also boils down to what you understand the word "elementary" to mean. It is definitely not synonymous to "simple", rather as Feynman put it, "only requiring an infinite intelligence to understand".

JackFr · on Oct 1, 2020

I recall realizing as an undergraduate math major, the simpler the title of the book the more impossibly dense the material would be.

If your textbook was "Advanced X for Scientists and Engineers", easy-peasy. But if the textbook was "Introductory X" you were in for it.

francasso · on Oct 1, 2020

I think it depends on the target audience. If it's math grad students that have had a first course of DG and probability and statistics, then it's elementary. I think the statement that it's self contained is practically incorrect, it's hard to believe that someone that doesn't already understand manifolds and connections can get a good enough intro to tackle IG from the overview in those notes.

Addendum: What I mean is that you can get much more abstract and take many more things for granted than it's done here. It's like reading an elementary introduction to the geometry of schemes, even if it's elementary there is a bunch of stuff that has to be assumed as known, even if the explanation of schemes per se is elementary

lambdaphagy · on Oct 1, 2020

I think elementary is supposed to mean "self-contained" or "using the basic methods appropriate to the field", not necessarily "easy".

frob · on Oct 1, 2020

"[T]he founder of modern information geometry, defined information geometry in the preface of his latest textbook [2] as follows: 'Information geometry is a method of exploring the world of information by means of modern geometry.' In short, information geometry geometrically investigates information sciences."

Holy circular definitions, Batman!

eternalban · on Oct 1, 2020

It's not actually circular. Think of the analog example of "exploring time-space of physical reality (physics)" by the means of modern mathematics:

https://en.wikipedia.org/wiki/Lorentz_transformation

So the general idea (haven't read OP yet) would be to map problems in information theory to geometric constructs, and use the machinery of modern geometry to explore that domain.

This is for example using algebraic topology to applications in concurrent processing:

https://www.springer.com/us/book/9783319153971

frob · on Oct 1, 2020

The difference between what you gave and what they did is you used different words in the definition from the description. They literally just use the same words in the name as in the definition. Their statement in your terms would be more like, "Mathematical Physics is a method of exploring the world of physics by means of modern mathematics," which is a useless statement.

By no means am I saying this field is worthless or a scam, I'm just saying those definitions are completely useless and circular.

fithisux · on Oct 12, 2020

A very interesting result is Cencov's theorem.

Just for the heck of it do you know if anyone has investigated IG and General Relativity? Can you share refs?

mjfl · on Oct 1, 2020

seems buzzwordy

whymauri · on Oct 1, 2020

Uh, how?

onecommentman · on Oct 1, 2020

Probably the poster defines a word as a buzzword if you don’t know why the underlying concept the word represents is of importance or relevance. And there is a lot of jargon and abstraction here. Without a “basic mathematical sophistication” = Bachelors in Pure Math, it would sound like abstract babble. Even with a Basic Mathematical Sophistication (BMS, I’m claiming the acronym), it’s pretty far out there. Fair enough.