But not a lot of discussion over there.
The visualizations are great, and this basically blew my mind. I didn’t know of the manifold hypothesis until now.
The manifold hypothesis is that natural data forms lower-dimensional
manifolds in its embedding space. There are both theoretical and
experimental reasons to believe this to be true. If you believe this, then
the task of a classification algorithm is fundamentally to separate a bunch
of tangled manifolds.
This could explain (for some definition of explain) the observed predictive power of relatively small neural networks.
FWIIW unsupervised learning and stuff like topological data analysis is almost entirely about discovering the actual manifolds (or some hand wavey topology). Doesn't always work; the data often doesn't cooperate and live on a metric space.
As I harp on every chance I can get, I have a pet hypothesis that there's a very deep corollary here waiting to be proven rigorously. Namely, that we can show there exist adversarial inputs that exploit neural networks because they're incompressible. Furthermore, that these inputs are information theoretically guaranteed to exploit the neural network (even if there are practical complexity theoretic workarounds).
I get the image of a technique that, when applied to humans, allows you to see through political speeches and reveals the eldritch horrors scurrying around us continually.
That isn't what the "manifold hypothesis" is good for. Manifold space is more interesting to think of, not in the deconstruction of metric space, but in the formation of it; i.e. what higher-dimensional, non-metric (intensive), manifold-space, through some complex function (eg physics), results in this lower-dimensional, feature-rich entity; and what is that function (or more interestingly, how does it change the resulting entity-state, given various changes in the manifolds).
Which may not be useful for classification, but that's old hat anyway. Personally, I feel that feature dependence for everything NN is brute force. How features come-to-be, even abstractly modeled, is the thing.
This is true by definition regardless of manifold hypothesis.
In order to define compression you dont need a nontrivial metric or a topological space. You need things to even talk about the manifold hypothesis, at least in any interesting way.
If you couldn't map a metric space (or other distance measure) onto a lower dimensional manifold, you're basically saying you can't forecast using a geometric representation, or you have already extracted all the signal from the noise in some other way.
There are interesting problems where you can't do anything with metrics or quasi-metric distance measures, but you can forecast things. Geometry is easier to reason about than Kolmogorov complexity though, hence papers like the above.
Let me give a concrete example, in compression and statistical estimation you can get by fine with KL divergence alone. KL divergence is not a metric and does not define a topology.
One can of course define metrics on the probability space or the space of parameters, but I dont see why that would be necessary ?
I already said you don't need a metric. But in the presence of an interesting metric or other distance measure; of course the signal fits on a lower dimensional manifold. Metric spaces and other distance measures give people a lot of tools to reason about data in general, hence virtually all common unsupervised learning algorithms and stuff like topological data analysis.
No, because not all data can actually be represented on a topology. It's nice if you can model your data as a normed vector space, but it's not inherently possible in general.
The ambient space may be huge, but data is not spread all over but lies on a tiny subset that can be well described by very few parameters. This limited degree of freedom in the data is what makes it easy to learn. A priori there is no reason why data should have such a property.
I would be interested in your thoughts on how you would map human language understanding to a Euclidean topology.
PS: There is a great introductory article on entropy on the blog that is worth checking out.
Given a new vector that needs to be classified (say, x), it is compared with its nearest neighbors in the data set (let's call them x_1, x_2, x_3,..,x_k). A weighted average of the categories of the nearest neighbors is calculated to classify x.
That is to say, the category y that x belongs to is given by some function of categories of the nearest neighbor (e.g. weighted sum)
y = f(w_1y_1 + .... + w_ky_k)
where f() is a function that converts the continuous weighted sum to one of the integers representing the categories, and y_1,...,y_k are categories that x_1,...,x_k belong to, respectively.
The weights w_1,....,w_k can be determined by optimizing some error function.