Geometric Understanding of Deep Learning (2018)

cs702 · on Jan 24, 2019

Wow. As far as I know, this is the first time anyone reputable[a] has claimed to show (!) that the "manifold hypothesis" is the fundamental principle that makes deep learning work, as has long been believed:

  "In this work, we give a geometric view to
   understand deep learning: we show that the
   fundamental principle attributing to the
   success is the manifold structure in data,
   namely natural high dimensional data
   concentrates close to a low-dimensional
   manifold, deep learning learns the manifold
   and the probability distribution on it."

Moreover, the authors also claim to have come up with a way of measuring how hard it is for any deep neural net (of fixed size) to learn a parametric representation of a particular lower-dimensional manifold embedded in some higher-dimensional space:

  "We further introduce the concepts of rectified
   linear complexity for deep neural network
   measuring its learning capability, rectified
   linear complexity of an embedding manifold
   describing the difficulty to be learned. Then
   we show for any deep neural network with fixed
   architecture, there exists a manifold that
   cannot be learned by the network."

Finally, the authors also propose a novel way to control the probability distribution in the latent space. I'm curious to see how their method compares and relates to recent work, e.g., with discrete and continuous normalizing flows:

  "...we propose to apply optimal mass
   transportation theory to control the
   probability distribution in the latent space."

This is not going to be a light read...

--

[a] One of the authors, Shing-Tung Yau, is a Fields medalist: https://news.ycombinator.com/item?id=18987219

siavosh · on Jan 24, 2019

I've been out of academia for a long time, but isn't the notion that 'natural' high dimensional data lies on low dimensional manifolds the basic premise of all machine learning techniques?

cs702 · on Jan 24, 2019

Yes, of course, this has long been a widely held assumption. That's why we call it the "manifold hypothesis."

As far as I know, no one has been able to show -- with mathematical rigor -- that this is why deep learning works so well on so many challenging perceptual ("cognitive") tasks.

That seems significant to me.

improbable22 · on Jan 24, 2019

But have they actually shown any such thing? There are some ambitious words, then some very elementary definitions, and some "theorems"... and then it ends?

byt143 · on Jan 24, 2019

Is this dual in any way to the information bottleneck hypothesis?

xtacy · on Jan 24, 2019

Yes, the authors mention it in the introduction as a "well accepted manifold assumption."

anonymousDan · on Jan 24, 2019

Can you ELI5 what 'control the probability distribution in the latent space' means?

cs702 · on Jan 24, 2019

It means being able to control the statistical properties of the embeddings (e.g., vector representations of objects) learned by the layers of a deep neural network.

lovelearning · on Jan 24, 2019

[An OT question as somebody not familiar with the academic world] Two of the authors are in a Chinese university. Two of them in different departments in a US university. In general, how does this kind of intercontinental collaboration start, and how do they progress? How are roles defined when multiple people are involved in a theoretical paper like this? Are there some tools that help with collaborative paper writing?

rhema · on Jan 24, 2019

Lots of ways. Typically people meet through conferences, where high-impact work is published. It's also not uncommon for researchers to cold-email each other based on mutual interest or friend-of-friend connection. Things like Google Scholar can be used to get emails about when your papers get cited, so this can spur some discussion / collaboration as well.

Honestly, you could email an author on this paper and they would probably tell you directly.

auntienomen · on Jan 24, 2019

Yau is the connection. He's a professor at Harvard, but he's also been a major force in the development of Chinese mathematical academic institutions. He's the director of several of them.

randcraw · on Jan 24, 2019

And Gu got his PhD in CS at Harvard in computational geometry. Yau teaches differential geometry there.

jesuslop · on Jan 24, 2019

Possibly less sophisticatedly I think of them as a sandwich of affine maps and nonlinear isotropies (as those giving irregular rings in tree trunks). The affinities are represented nicely in GL(n+1) with a homogenous coordinates trick related to neuron biases. A question would be if there's something interesting to say about the interactions of the affinities and isotropies in group theoretic terms (which I dunno).

MAXPOOL · on Jan 24, 2019

ReLU is very simple in this regard. In the plain form it's just affine transformation followed by 'a viewport'. The mapping trough multiple layers is alternating affine transformations and windows into data. Learning is combination of squeezing and rotating data so that it can be seen or unseen from the window and rotating window frames to do the same.

Any extra stuff, like batch normalization between the layers can again introduce more complex nonlinearity.

jesuslop · on Jan 24, 2019

Really agree with the extra simplicity in the ReLU case. Liwen Zhang, Gregory Naitzat and Lek-Heng Lim showed last year that "the family of such neural networks is equivalent to the family of tropical rational maps", where rational functions are quotient of polynomials and "tropical" is in the context of Tropical Geometry, where instead of the typical "plus, times" ring one uses a "min, plus" ring that has somewhat unexpected applications. For instance, with its module theory one can calculate minimum costs paths in graphs just as one does for reachability starting with the adjacency matrix of a graph over a boolean semiring. arXiv:1805.07091v1

yoquan · on Jan 24, 2019

I'm reading this and just realized one author is the famous Field medalist, Shing-Tung Yau :-)

soVeryTired · on Jan 24, 2019

Sorta like when David Mumford got interested in computer vision, I guess.

jesuslop · on Jan 29, 2019

you' kiddin?

soVeryTired · on Jan 29, 2019

Nope: http://www.dam.brown.edu/people/mumford/vision/introvision.h...

Tim Gowers is interested in automated theorem proving too!

jesuslop · on Jan 31, 2019

Surprising. Voeovdski also.

jesuslop · on Jan 31, 2019

Also positioned as not anti proof assistants I meant.

throwawaymath · on Jan 24, 2019

> ...we show that the fundamental principle attributing to the success is the manifold structure in data...

> Then we show for any deep neural network with fixed architecture, there exists a manifold that cannot be learned by the network.

I'd venture a guess that you can extend this result to show that, for any deep neural network with fixed architecture, there exists an adversarial manifold it must be vulnerable to.

In other words not only is there a manifold the neural network cannot learn, but there is also a manifold it will learn, but incorrectly.

crimsonalucard · on Jan 24, 2019

Anybody know what I should study in order to understand this research paper?

dragqueen · on Jan 24, 2019

Looks like you need to know a tiny little bit about manifolds, measure theory/probability theory, topology and more importantly have the requisite "math maturity".

For very easy intro to manifolds and measure you could probably take a look into [0] A Visual Introduction to Differential Forms and Calculus on Manifolds by Fortney and [1] The Lebesgue Integral for Undergraduates by Johnston.

[0] https://www.amazon.com/Visual-Introduction-Differential-Calc...

[1] https://www.amazon.com/Lebesgue-Integral-Undergraduates-MAA-...

mlevental · on Jan 24, 2019

yea despite the purported firepower of the authors this is not a dense paper

AnimalMuppet · on Jan 24, 2019

Maybe that shows most clearly the firepower of the authors...

improbable22 · on Jan 24, 2019

I'd reject it. I'm honestly struggling to find anything at all to chew on here...

ttflee · on Jan 25, 2019

Another interesting paper on optimal transportation and GAN: https://arxiv.org/abs/1710.05488

twic · on Jan 24, 2019

I know next to nothing about deep learning. But this geometric interpretation really reminds of of the way self-organising maps work. Is there a real connection there, or is that superficial?

FakeComments · on Jan 24, 2019

I may be wrong, having just heard of self-organizing maps...

But they do seem related, in that it’s arguing the data can always be (mostly) accurately represented by a map projecting onto a lower-dimension manifold. The recent paper on topological covers along with the self-organizing maps wiki seem to indicate that nothing should be harmed by doing this in a discrete setting.

Essentially, a self-organizing map projecting to the R^n slices, and then learning the shape of the manifold generated by that atlas of projections.

heinrichf · on Jan 24, 2019

(uploaded to the arXiv in May 2018)

sctb · on Jan 24, 2019

Thanks! We've updated the headline.