Hacker News new | more | comments | ask | show | jobs | submit login
Neural Networks, Manifolds, and Topology (2014) (colah.github.io)
129 points by flancian 10 days ago | hide | past | web | favorite | 25 comments


https://news.ycombinator.com/item?id=7557964 https://news.ycombinator.com/item?id=9814114

But not a lot of discussion over there.

The visualizations are great, and this basically blew my mind. I didn’t know of the manifold hypothesis until now.

  The manifold hypothesis is that natural data forms lower-dimensional
  manifolds in its embedding space. There are both theoretical and
  experimental reasons to believe this to be true. If you believe this, then
  the task of a classification algorithm is fundamentally to separate a bunch
  of tangled manifolds.
My interpretation/rephrasing: if you want to build a neural network that distinguishes cat and dog pictures, in the worst case that would seem to require a huge network with many nodes/layers (say, the number being a function of the size of the image) rather than the number that seems to work reasonably well in practice (six or some other rather low constant number observed in reality). So the number of dimensions over which the “images” are potentially spread is huge, but it’d seem that in the real world one can rearrange the dog and cat images in a “shape” that then allows for relatively easy disentanglement by the neural network; and these shapes can probably be realized in much lower dimensions (in the example, six).

This could explain (for some definition of explain) the observed predictive power of relatively small neural networks.

It seems profound, but it's not really saying anything different than "compression is the same thing as forecasting."

FWIIW unsupervised learning and stuff like topological data analysis is almost entirely about discovering the actual manifolds (or some hand wavey topology). Doesn't always work; the data often doesn't cooperate and live on a metric space.

That's a great way of saying it concisely. And to continue that generalization - one of the consequences of information theory is that there exist trivially incompressible strings. This plays nicely with the recent (2018) Lei-Luo-Yau-Gu result that there exist manifolds which cannot be learned.

As I harp on every chance I can get, I have a pet hypothesis that there's a very deep corollary here waiting to be proven rigorously. Namely, that we can show there exist adversarial inputs that exploit neural networks because they're incompressible. Furthermore, that these inputs are information theoretically guaranteed to exploit the neural network (even if there are practical complexity theoretic workarounds).

>Furthermore, that these inputs are information theoretically guaranteed to exploit the neural network (even if there are practical complexity theoretic workarounds).

I get the image of a technique that, when applied to humans, allows you to see through political speeches and reveals the eldritch horrors scurrying around us continually.

That's if you go about it from low-dimensional embedding to feature classification, and predictions based on probabilities for those features. As you mention, that works if you have metric data, or can juice metrics from abstraction. Modern ML is all about solving for this case; adding manifold hypothesis, as the author noted, may only improve them slightly, or not at all.

That isn't what the "manifold hypothesis" is good for. Manifold space is more interesting to think of, not in the deconstruction of metric space, but in the formation of it; i.e. what higher-dimensional, non-metric (intensive), manifold-space, through some complex function (eg physics), results in this lower-dimensional, feature-rich entity; and what is that function (or more interestingly, how does it change the resulting entity-state, given various changes in the manifolds).

Which may not be useful for classification, but that's old hat anyway. Personally, I feel that feature dependence for everything NN is brute force. How features come-to-be, even abstractly modeled, is the thing.

"compression is the same thing as forecasting"

This is true by definition regardless of manifold hypothesis.

In order to define compression you dont need a nontrivial metric or a topological space. You need things to even talk about the manifold hypothesis, at least in any interesting way.

Well I'll say it a different way: the manifold hypothesis is basically a geometric restatement of the fact that compression is the same thing as forecasting.

If you couldn't map a metric space (or other distance measure) onto a lower dimensional manifold, you're basically saying you can't forecast using a geometric representation, or you have already extracted all the signal from the noise in some other way.

There are interesting problems where you can't do anything with metrics or quasi-metric distance measures, but you can forecast things. Geometry is easier to reason about than Kolmogorov complexity though, hence papers like the above.

Not quite buying that yet. Why would one need a metric ?

Let me give a concrete example, in compression and statistical estimation you can get by fine with KL divergence alone. KL divergence is not a metric and does not define a topology.

One can of course define metrics on the probability space or the space of parameters, but I dont see why that would be necessary ?

Hey man, I didn't post the paper on neural nets, manifolds and topology; I'm not going to defend it.

I already said you don't need a metric. But in the presence of an interesting metric or other distance measure; of course the signal fits on a lower dimensional manifold. Metric spaces and other distance measures give people a lot of tools to reason about data in general, hence virtually all common unsupervised learning algorithms and stuff like topological data analysis.

Oh the post is a good one. What I was drawing attention to is that manifold hypothesis and the compression hypothesis are different concepts. The compression hypothesis is more general to the point of being a definition. It holds true regardless of whether the low dimensional manifold property holds or not.

To play devil's advocate... given that manifolds are so general, wouldn't it be more surprising if natural data didn't form some lower dimensional manifold? That is, a manifold only requires some consistent notion of neighborhoods and a way to locally parameterize objects, which is a fairly low bar to pass, and perhaps trivial if you knock off one pixel in the case of images. Maybe the surprising point (which is not said explicitly) is that the manifold is very low dimensional compared to sensor data, but I suppose it's hard to formulate a hypothesis about the magnitude of the reduction besides it being "lower".

> To play devil's advocate... given that manifolds are so general, wouldn't it be more surprising if natural data didn't form some lower dimensional manifold?

No, because not all data can actually be represented on a topology. It's nice if you can model your data as a normed vector space, but it's not inherently possible in general.

In a trivial sense, it seems that a Euclidean topology would do the trick, no? I'd be curious to hear counter-examples, of course. In my mind, I suppose what's missing from the "manifold hypothesis", as stated above, is that the manifolds should be more useful than raw data, for example, does their structure respect our notion of object categories or are they sufficiently low-dimensional for visualization?

They key is the low dimensional part not that it is a manifold.

The ambient space may be huge, but data is not spread all over but lies on a tiny subset that can be well described by very few parameters. This limited degree of freedom in the data is what makes it easy to learn. A priori there is no reason why data should have such a property.

I would be interested in your thoughts on how you would map human language understanding to a Euclidean topology.

A counterexample would be a dataset that forms a fractal in the ambient space. I don't know of a realistic example of this, but it seems plausible if you think of scale-invariant phenomena. Other ways of getting fractal-like structures, or at least "dust-like" structures with weird topologies, is by taking intersections of smooth structures. These things would be hard to separate...

If some data does form a manifold, then you just need to learn one continuous function per class. If it doesn't, then you need to learn several functions for different continuous regions per class. You need many more parameters to fit data that is fragmented essentially. "Separating tangled manifolds" is really learning the functions for each manifold. ( I may be off base here, the material is not fresh in my mind )

well, the task of distinguishing cats and dogs is vastly different from the task of what defines a dog and a cat. You don't need to learn the underlying data-manifold, but just a decision-function between those two data-manifolds. We can't really build scalable models for the second problem yet, but have made great progress for the first one.

How do people even submit URLs more than once on Hacker News?

Gunnar Carlsson will be teaching a related tutorial ("Using topological data analysis to understand, build, and improve neural networks") on April 16th in New York City https://conferences.oreilly.com/artificial-intelligence/ai-n...

This was the article that helped me get neural networks when I began studying them few years back. Interpreting them as a series of curvillinear coordinate transformations really helped me understand them better.

PS: There is a great introductory article on entropy on the blog that is worth checking out.

This is beautiful, and surprisingly approachable. Also, feels relevant to this recent conversation: https://news.ycombinator.com/item?id=18987211

If anyone could point me to literature on k-nn neural networks (or the relationship, if any, between k-nn algorithms and basis function decomposition and/or blind source separation) I’d be much obliged.

He mentioned that he applied knn to handwritten digit recognition. A likely approach would be as follows.

Given a new vector that needs to be classified (say, x), it is compared with its nearest neighbors in the data set (let's call them x_1, x_2, x_3,..,x_k). A weighted average of the categories of the nearest neighbors is calculated to classify x.

That is to say, the category y that x belongs to is given by some function of categories of the nearest neighbor (e.g. weighted sum)

y = f(w_1y_1 + .... + w_ky_k)

where f() is a function that converts the continuous weighted sum to one of the integers representing the categories, and y_1,...,y_k are categories that x_1,...,x_k belong to, respectively.

The weights w_1,....,w_k can be determined by optimizing some error function.

Haven't given it a lot of thought, but isn't his vector field idea somewhat similar of an approach to neural ordinary differential equations?

Applications are open for YC Summer 2019

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact