Now what's all this gotta do with Information? Usually, information is represented in terms of statistical distributions, from Shannon's information theory. What the early founders of IG observed is that, these statistical distributions can be represented as points on some curved space called a Statistical Manifold. Now, all the terms used in information theory can be reinterpreted in terms of geometry.
So, why is it so exciting? Well in Deep Learning people predominantly work statistical distributions, some even without realising it. All our optimizations involve reducing distance between some statistical distributions like the distribution of of the data and the distribution that the neural network is trying to model. Turns out, such optimization when done in the space of statistical manifold, amounts to the gradient descent that we all know and love. All the gradient based optimisations are only approximations to the local geometry like gradient(local slope) , Hessian(local quadratic approximation of curvature), but optimisation in the statistical manifold can yield the exact curvature and thus are more efficient. This method is called Natural Gradient.
Hope this helps.
But does IG allow us to reason about Neural Nets in new ways that could move the needle on open questions about information representation in ANNs or, even better, BNNs?
The introductory articles like this tend to get bogged down in the definitions, and research-level articles are pretty impenetrable. If anyone here has a little more insight into this field, I'd be really interested to know more about the impact IG has had on other areas (be it applications to the real world, or to helping organize our understanding of another field of math).
People babbling about neural nets; that's probably wrong -never seen an actual application there, though TDA does very interesting things for neural approaches for reasons that should be obvious.
I kind of wish I had done the GR course with Carlo Rovelli and Ezra Newman back when I had the chance; the first semester was all differential geometry, and those dudes were great at explaining math. Since the Einstein equations are more nasty than the Fisher information matrix, it would have made all the info-geo stuff pretty accessible.
A while ago I read about a guy's innovative idea to classify languages using zip (the archiving software). This idea is cute in itself, if you are curious you can check . My rough summary would be like this: when you zip a French text, the zip facility first spends some time to train itself, and finds the letter frequencies and then compresses the text using these frequencies. If you train zip to a French text, but then use the frequencies to compress a Spanish text, you'll do worse than if you use the frequencies of the Spanish language itself. However, if you use the French frequencies to compress Hungarian text, you will do much, much worse. Then you ask the question: if for a certain language A, a zipped text is 1 MB using the language's own frequencies but 1.3MB using the frequencies form a different language B, then you say the distance between A and B is 0.3. If it takes 1.05 MB, then the distance is 0.05. If two languages are very similar like Nrwegian and Swedish, this distance is very small, if they are very different, the distance could be quite large.
This concept of distance is essentially the KL-divergence . It's not symmetric, so the distance between A and B is different compared to the distance between B and A, so it's not exactly what mathematicians call a "distance". But it's very useful nonetheless. When you train a classifier for example, many times you measure the quality of the classifier using the KL-divergence, but in that context people use the term cross-entropy.
You can for example do a cluster analysis and find groups of languages that are related, like the Romance languages, or Slavic languages, and this without knowing anything about these languages at all.
Now, how many languages are out there? Let's say 5000. Imagine you want to create one of those cool graphs where each language is linked with its closes neighbors and the length of that edge is equal to this "distance". You won't be able to do this exactly, because you are constrained by the geometry of the flat plane, but if you allow yourself to go in higher dimensions, you can do that ( * ). This graph will then have some shape, it's going to be somewhat curved.
For example, if you restrict yourself to the Romance languages, and you put Latin in the center and put the current languages (Spanish, Portuguese, Catalan, French, Italian, Romanian, there are probably a few more) around it, you get something that looks like a circle. Is the circumference of this "circle" higher, or lower than 2piradius? If it's lower, then you say the space has positive curvature, if it's higher, negative.
So, information geometry is 1. taking care of the various details I brushed off (like the fact that you actually use the square root of the KL-divergence to define the distance) and 2. studying the geometric properties of the metric space you produce.
( * ) This fact is known as the Nash embedding theorem, the same Nash who got a Nobel for the Nash equilibrium.
The only thing I remember about Information Geometry in a practical sense was it was related to EM (Expectation-Maximization) algorithms in some way. Most of the associated information geometry research was done in Japan, so logical that Sony would have an interest.
Information geometry sounds like a “theory of everything” sort of abstraction that either should lead to breakthrough insights, or just be a forensic sort of postgame wrap-up explaining what’s really going on in that most practical yet ugliest of all mathematical disciplines, statistics. What I remember seemed to me to be a lot of “Adventures in Dualities”, so your mind has to be comfortable with that.
Then again, who needs recreational drugs when you have pure math? Wonder if DEA has considered that motto.
I'm going to read at least the first part because of this comment. Thanks.
Hmm. Expectation Maximization, as opposed to Regret Minimization?
The EM iteration alternates between performing an expectation (E) step, which creates a function for the expectation of the log-likelihood evaluated using the current estimate for the parameters, and a maximization (M) step, which computes parameters maximizing the expected log-likelihood found on the E step. These parameter-estimates are then used to determine the distribution of the latent variables in the next E step.
Just for the heck of it do you know if anyone has investigated IG and General Relativity? Can you share refs?
Equation 20 defining curvature has a typo in the first term on the right - It should be ∇_X∇_Y Z instead of ∇_X∇_Y X. Might be useful for anyone reading it the first time to avoid confusion.
(I like reading about theory but I'm clearly going to have to put this one off until I've learned some differential geometry, hopefully in this lifetime.)
I think differential geometry is a field where it really helps to think about the "types" of each object, and to write the type signatures down explicitly for every function of interest. Unfortunately even Lee's book doesn't always do this, but I think it'll help you when taking notes.
If you just want to learn about the "spirit" of DG and not the rigorous details, I highly highly recommend Keenan Crane's "Discrete Differential Geometry" book  and course [2,3]. It's very computationally-oriented and you'd be in a good position to read more about the continuous case afterwards.
 due to differences in order of topics and typographical style
> I think differential geometry is a field where it really helps to think about the "types" of each object, and to write the type signatures down explicitly for every function of interest.
Do you have an opinion on https://mitpress.mit.edu/books/functional-differential-geome...?
I haven't gone through any of it, but I have gone through some of their companion book on classical mechanics. https://mitpress.mit.edu/sites/default/files/titles/content/...
I really like the idea of providing code+math together! I applaud their attempt at using code to clarify differential geometry notation, which is very overloaded.
However, I am extremely skeptical that using Scheme (or any Lispy language) is the best choice here. I'm guessing the choice of Scheme has little to do with pedagogical value and more to do with the fact that Scheme was created by Sussman, one of the authors.
I think Scheme is a poor choice here, mostly due to its lack of static types / type annotations. I'm also not a fan of S-expressions, but that's a different can of worms. To me, the lack of type information makes Scheme even more difficult to read than appropriately-parenthesized math notation. I think Haskell, Idris, or Julia might be more effective for clear communication of mathematical ideas.
I think this quote sums up my view nicely:
> "Dynamic typing" The belief that you can't explain to a computer why your code works, but you can keep track of it all in your head. (https://chris-martin.org/2015/dynamic-typing)
The book is probably very effective at teaching differential geometry to a target audience that already knows Scheme inside and out. But to everyone else? I'm not so sure.
you can understand everything in those papers without understanding differential geometry, just basic linear algebra -- the core idea is just preconditioning gradients by the inverse of the Fisher information matrix.
the deeper theoretical stuff, i have no idea about, it seems like it must be fascinating though!
Seems better intro.
(I don’t know why the culture of pure math persists in this. I throw it in with proof by intimidation. Makes pure math types sound like they’re stuck intellectually/socially at being highly precocious 14 year olds. It’s not the individual, it’s the culture.)
Chentsov, that’s a name I haven’t heard in a long time. There’s a result that’s worth being generally known by the data analysis community. The article looks like it gets there more clearly than others I read a while ago.
This also boils down to what you understand the word "elementary" to mean. It is definitely not synonymous to "simple", rather as Feynman put it, "only requiring an infinite intelligence to understand".
If your textbook was "Advanced X for Scientists and Engineers", easy-peasy. But if the textbook was "Introductory X" you were in for it.
Addendum: What I mean is that you can get much more abstract and take many more things for granted than it's done here. It's like reading an elementary introduction to the geometry of schemes, even if it's elementary there is a bunch of stuff that has to be assumed as known, even if the explanation of schemes per se is elementary
Holy circular definitions, Batman!
So the general idea (haven't read OP yet) would be to map problems in information theory to geometric constructs, and use the machinery of modern geometry to explore that domain.
This is for example using algebraic topology to applications in concurrent processing:
By no means am I saying this field is worthless or a scam, I'm just saying those definitions are completely useless and circular.