Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Geometric Understanding of Deep Learning (2018) (arxiv.org)
164 points by yoquan on Jan 24, 2019 | hide | past | favorite | 32 comments


Wow. As far as I know, this is the first time anyone reputable[a] has claimed to show (!) that the "manifold hypothesis" is the fundamental principle that makes deep learning work, as has long been believed:

  "In this work, we give a geometric view to
   understand deep learning: we show that the
   fundamental principle attributing to the
   success is the manifold structure in data,
   namely natural high dimensional data
   concentrates close to a low-dimensional
   manifold, deep learning learns the manifold
   and the probability distribution on it."
Moreover, the authors also claim to have come up with a way of measuring how hard it is for any deep neural net (of fixed size) to learn a parametric representation of a particular lower-dimensional manifold embedded in some higher-dimensional space:

  "We further introduce the concepts of rectified
   linear complexity for deep neural network
   measuring its learning capability, rectified
   linear complexity of an embedding manifold
   describing the difficulty to be learned. Then
   we show for any deep neural network with fixed
   architecture, there exists a manifold that
   cannot be learned by the network."
Finally, the authors also propose a novel way to control the probability distribution in the latent space. I'm curious to see how their method compares and relates to recent work, e.g., with discrete and continuous normalizing flows:

  "...we propose to apply optimal mass
   transportation theory to control the
   probability distribution in the latent space."
This is not going to be a light read...

--

[a] One of the authors, Shing-Tung Yau, is a Fields medalist: https://news.ycombinator.com/item?id=18987219


I've been out of academia for a long time, but isn't the notion that 'natural' high dimensional data lies on low dimensional manifolds the basic premise of all machine learning techniques?


Yes, of course, this has long been a widely held assumption. That's why we call it the "manifold hypothesis."

As far as I know, no one has been able to show -- with mathematical rigor -- that this is why deep learning works so well on so many challenging perceptual ("cognitive") tasks.

That seems significant to me.


But have they actually shown any such thing? There are some ambitious words, then some very elementary definitions, and some "theorems"... and then it ends?


Is this dual in any way to the information bottleneck hypothesis?


Yes, the authors mention it in the introduction as a "well accepted manifold assumption."


Can you ELI5 what 'control the probability distribution in the latent space' means?


It means being able to control the statistical properties of the embeddings (e.g., vector representations of objects) learned by the layers of a deep neural network.


[An OT question as somebody not familiar with the academic world] Two of the authors are in a Chinese university. Two of them in different departments in a US university. In general, how does this kind of intercontinental collaboration start, and how do they progress? How are roles defined when multiple people are involved in a theoretical paper like this? Are there some tools that help with collaborative paper writing?


Lots of ways. Typically people meet through conferences, where high-impact work is published. It's also not uncommon for researchers to cold-email each other based on mutual interest or friend-of-friend connection. Things like Google Scholar can be used to get emails about when your papers get cited, so this can spur some discussion / collaboration as well.

Honestly, you could email an author on this paper and they would probably tell you directly.


Yau is the connection. He's a professor at Harvard, but he's also been a major force in the development of Chinese mathematical academic institutions. He's the director of several of them.


And Gu got his PhD in CS at Harvard in computational geometry. Yau teaches differential geometry there.


Possibly less sophisticatedly I think of them as a sandwich of affine maps and nonlinear isotropies (as those giving irregular rings in tree trunks). The affinities are represented nicely in GL(n+1) with a homogenous coordinates trick related to neuron biases. A question would be if there's something interesting to say about the interactions of the affinities and isotropies in group theoretic terms (which I dunno).


ReLU is very simple in this regard. In the plain form it's just affine transformation followed by 'a viewport'. The mapping trough multiple layers is alternating affine transformations and windows into data. Learning is combination of squeezing and rotating data so that it can be seen or unseen from the window and rotating window frames to do the same.

Any extra stuff, like batch normalization between the layers can again introduce more complex nonlinearity.


Really agree with the extra simplicity in the ReLU case. Liwen Zhang, Gregory Naitzat and Lek-Heng Lim showed last year that "the family of such neural networks is equivalent to the family of tropical rational maps", where rational functions are quotient of polynomials and "tropical" is in the context of Tropical Geometry, where instead of the typical "plus, times" ring one uses a "min, plus" ring that has somewhat unexpected applications. For instance, with its module theory one can calculate minimum costs paths in graphs just as one does for reachability starting with the adjacency matrix of a graph over a boolean semiring. arXiv:1805.07091v1


I'm reading this and just realized one author is the famous Field medalist, Shing-Tung Yau :-)


Sorta like when David Mumford got interested in computer vision, I guess.


you' kiddin?


Nope: http://www.dam.brown.edu/people/mumford/vision/introvision.h...

Tim Gowers is interested in automated theorem proving too!


Surprising. Voeovdski also.


Also positioned as not anti proof assistants I meant.


> ...we show that the fundamental principle attributing to the success is the manifold structure in data...

> Then we show for any deep neural network with fixed architecture, there exists a manifold that cannot be learned by the network.

I'd venture a guess that you can extend this result to show that, for any deep neural network with fixed architecture, there exists an adversarial manifold it must be vulnerable to.

In other words not only is there a manifold the neural network cannot learn, but there is also a manifold it will learn, but incorrectly.


Anybody know what I should study in order to understand this research paper?


Looks like you need to know a tiny little bit about manifolds, measure theory/probability theory, topology and more importantly have the requisite "math maturity".

For very easy intro to manifolds and measure you could probably take a look into [0] A Visual Introduction to Differential Forms and Calculus on Manifolds by Fortney and [1] The Lebesgue Integral for Undergraduates by Johnston.

[0] https://www.amazon.com/Visual-Introduction-Differential-Calc...

[1] https://www.amazon.com/Lebesgue-Integral-Undergraduates-MAA-...


yea despite the purported firepower of the authors this is not a dense paper


Maybe that shows most clearly the firepower of the authors...


I'd reject it. I'm honestly struggling to find anything at all to chew on here...


Another interesting paper on optimal transportation and GAN: https://arxiv.org/abs/1710.05488


I know next to nothing about deep learning. But this geometric interpretation really reminds of of the way self-organising maps work. Is there a real connection there, or is that superficial?


I may be wrong, having just heard of self-organizing maps...

But they do seem related, in that it’s arguing the data can always be (mostly) accurately represented by a map projecting onto a lower-dimension manifold. The recent paper on topological covers along with the self-organizing maps wiki seem to indicate that nothing should be harmed by doing this in a discrete setting.

Essentially, a self-organizing map projecting to the R^n slices, and then learning the shape of the manifold generated by that atlas of projections.


(uploaded to the arXiv in May 2018)


Thanks! We've updated the headline.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: