
Neural Networks, Manifolds, and Topology - cf
http://colah.github.io/posts/2014-03-NN-Manifolds-Topology/
======
ot
This is a very interesting perspective on neural networks, is it novel? I've
never seen a geometric interpretation of NNs.

In particular this:

> (Apparently determining if knots are trivial is NP. This doesn’t bode well
> for neural networks.)

Is there any theoretical research on obstructions to NN learning? Not that it
would change much the practice (for instance, MLE learning of gaussian
mixtures is NP-hard, but everybody does it anyway), but it could shed some
light on the geometry of the "hard" instances.

EDIT: for example, to get a sense of how ill-conditioned is deep NN learning,
a recent paper [1] shows that if we feed to an object classifying NN an image
with crafted but unnoticeable perturbations the predictions change completely.

[1] [http://arxiv.org/abs/1312.6199](http://arxiv.org/abs/1312.6199)

~~~
ejenk
I think the author has confused NP with NP-hard (or NP-complete). Indeed, it
is extremely unlikely that the unknottedness problem is NP-complete, as it is
already in NP ∩ co-NP (assuming the generalized Riemann hypothesis).
Furthermore, we know that the problem of computing many knot invariants is
"only" BQP-complete, so it's possible that unknottedness is in BQP (i.e.,
efficiently solvable by quantum computers), though I don't know the expert
opinion on this.

~~~
ot
Indeed I read that as NP-complete. That's _very_ interesting, do you have any
references on it? Even if knots are ruled out, there might be hard topological
decision problems that can be reduced to linear separability under
homeomorphism, maybe in higher dimension.

------
tlarkworthy
Bloody brilliant. I have longed for a ML methodology that breaks out of the
float vector representation because I don't think it describes naturally
occurring problems that well, most ontologies are semi-structured. However,
this really gets to the nub of the problem, real, but solvable, problems are
manifold complexes. I don't think anyone has really built a learner for
complexes though, persistent homology is a stab in the right direction but its
not there yet IMHO.

------
yummyfajitas
I don't like his proof that a 1-2 hidden unit system can't handle the red
circle in the blue torus. Feels too handwavy.

My proof: Suppose the matrix W is singular. Then there exists a nonzero vector
v for which W v = 0. Find a point x in the red circle, then find a different
point y=x+av (for some scalar a) in the blue torus.
Wy+b=W(x+av)+b=Wx+aWv+b=Wx+a0+b=wX+b. Fin.

------
ArbitraryLimits
Wouldn't this only apply to neural networks as used for classification? I mean
the general paradigm of deforming curves until they're separated by a
hyperplane seems pretty obvious now that I see it in front of me, but what
about neural networks used to approximate continuous functions?

~~~
GrantS
I'll take a stab at this (I'm a decade out from my last machine learning
class, so no guarantees on correctness): The only reason it's fitting a
hyperplane is because one class is being mapped to the continuous value -1.0
and the other class is being mapped to the continuous value 1.0 and there's a
thresholding step (the hyperplane perpendicular to the line onto which the
continuous values are being projected) at the end to determine the class. If
you're doing regression instead of classification, your training data will be
fed in with more output values than just 1.0 and -1.0 and you'll omit the
thresholding at the end, but otherwise the behavior and intuition should be
the same as in the article.

------
sushirain
The article is exceptionally clear and well presented. We need more of this
level of writing about machine learning.

For me, it is most important to note what the article touches only at the end.
If you have good features and representation, a linear model may be enough.
Real world data and tasks usually don't require knot untying. In my opinion,
the model (e.g., neural net) should be kept simple, and the challenge is in
finding the right features and representation.

Coates & Ng (2012) wrote:

Many algorithms are available to learn deep hierarchies of features from
unlabeled data, especially images. In many cases, these algorithms involve
multi-layered networks of features (e.g., neural net- works) that are
sometimes tricky to train and tune and are difficult to scale up to many
machines effectively. Recently, it has been found that K-means clustering can
be used as a fast alternative training method. The main advantage of this
approach is that it is very fast and easily implemented at large scale. On the
other hand, employing this method in practice is not completely trivial:
K-means has several limitations, and care must be taken to combine the right
ingredients to get the system to work well.

[http://www.stanford.edu/~acoates/papers/coatesng_nntot2012.p...](http://www.stanford.edu/~acoates/papers/coatesng_nntot2012.pdf)

------
gajomi
This is very interesting. I wonder if anyone knows what can be gained in terms
of separating out different topologies by allowing for a restricted family of
conformal transformation on top of the tanh functions, of the sort that allows
taking the insides out and so forth. Comments?

~~~
ot
Why conformal? That seems a little too rigid.

~~~
gajomi
To get the effect I was referring to, yes it is too strict a class. I wasn't
thinking very hard about it, but my idea was that conformal transformation
have nice compositional properties, which could be exploited in developing the
method and its analysis.

------
TomAnthony
A bit of an aside: Are there any methods for
compressing/collapsing/simplifying a neural network.

What I mean is imagine you've built and trained a neural network, as per the
article, it is hard to ascertain exactly what it is doing. I was wondering
whether there is work in this area, and it occurred to me a possible first
step would be to collapse the neural network to a simpler but functionally
equivalent structure.

I imagine this is far more difficult than it sounds, but I just wondered.

~~~
xcthulhu
The Universal Approximation Theorem[1] asserts that you only ever need one
hidden layer, which at least asserts that "an (approximate) simplification
exists".

But I can't say off the top of my head how you'd collapse an ANN just two
hidden layers into 1. It's not obvious how sigmoid functions compose, but I
suppose I should give it more thought...

\---------- 1:
[http://en.wikipedia.org/wiki/Universal_approximation_theorem](http://en.wikipedia.org/wiki/Universal_approximation_theorem)

~~~
michaelochurch
Although one hidden layer suffices theoretically, most of the cutting-edge
work involves multilayer networks. Perhaps paradoxically, those become more
"auditable" from what I've heard. (I've spoken to people with real-world
experience of applying ANNs to unsolved problems, but I am not one.) What I've
heard is that, with one layer, you get too much overloading (i.e. different
stuff ending up at the same hidden node) to understand what it is doing, but
that deeper networks are more legible (with appropriate visualization tools).

UAT says that a solution exists, but it doesn't put a limit on the number of
nodes required, so it would have you doing an optimization in a space that is
not just large, but of arbitrary finite dimension. It can be pretty
nonconstructive (in the sense of proving "there exists" without showing how to
find something) insofar as it's often non-trivial to get convergence to a
working solution in reasonable time.

As for how sigmoids compose, imagine how bell-shaped curves would compose,
just as you can make a painting out of bell-shaped "points" if allowed
arbitrary precision/steepness. Now, the difference of two sigmoids can be
bell-shaped, e.g.
[http://www.wolframalpha.com/input/?i=plot+y+%3D+1%2F%281%2Be...](http://www.wolframalpha.com/input/?i=plot+y+%3D+1%2F%281%2Be%5E-x%29+-+1%2F%281%2Be%5E-%28x-1%29%29+from+x+%3D+-5.5+to+6.5)
. I don't know how much this means in practice, but it establishes the
possibility.

~~~
agibsonccc
With respect to the deep learning networks ( as well as more traditional with
just a weight matrix and bias), we can look to matrices for this. I will offer
a compressed representation, not so much a way of pruning, but I explain why
below.

Each single layer neural network is made up of 3 matrices, a weight matrix
(connections), visible bias, and a hidden bias.

In theory, this can be represented as a flattened array.

This is what I do in deeplearning4j[1] for optimization (note: I'm the author)

The problem with pruning neural nets so to speak is this isn't really a search
problem we're solving like A*. Both are graphs in a sense, but each neuron in
a neural net when we have backpropagation gets updated to correct for error
that it caused, rather than pruning like in Alpha Beta Pruning for Game
playing AI.

I will offer one last thing and say that the way neural nets learn (especially
if you stream data in to it for training rather than training all at once via
online/mini batch learning) each neuron also tends to learn different
components of an overall solution and some will activate more on certain
feature vectors when you train them on an overall data set.

A solution to this is to set the neurons relative to the input size[2].

[1] [https://deeplearning4j.org/](https://deeplearning4j.org/)

[2]
[http://www.cs.toronto.edu/~hinton/absps/guideTR.pdf](http://www.cs.toronto.edu/~hinton/absps/guideTR.pdf)

------
rdtsc
It is always good to reference SVMs
([http://en.wikipedia.org/wiki/Support_vector_machine](http://en.wikipedia.org/wiki/Support_vector_machine))
when talking about neural networks.

The advantage of SVMs is that after re-mapping the data so it becomes
separable, they also guarantee to separate it as clear as possible.

Not sure if they can solve all the type of problems as well but in some cases
it is considered a better, more analytic approach.

~~~
reader5000
"after re-mapping the data so it becomes separable" This is sort of the hard
part.

~~~
rdtsc
Not saying it isn't that is the main bulk of the algorithm, just saying it
there is an additional benefit of SVMs in the end.

~~~
fear91
SVM is great, it's often stacked on top of neural networks to do
classification.

But it can't really deal with extracting features, especially from things like
images and sounds. If you fed images ( pixel by pixel ) straight to the SVM
and tell it to classify them, it would do extremely poorly - especially with
big images.

A better thing is to have a deep ( layered ) network that will get pixel data
from the input and pass it through layers that will then extract some useful
features from it - for example, learn to recognize lines.

Then, after this information passes through the whole network, you are
finished with smaller amount of features that are more useful for
classification. Then you an send this information to SVM and it will deal with
it a lot easier than if you tried to feed it straight pixels.

The thing is, we don't really have problems with classification algorithms -
SVM does extremely well. What is the biggest hurdle is simply extracting
features that then can be fed to well-working classification algorithms.

------
Houshalter
Cascade correlation networks are known to be good at solving the spirals
problem. They are also very fast to train.

------
derf_
> One could learn the vector field at fixed points (just take some fixed
> points from the training set to use as anchors) and interpolate in some
> manner.

One such interpolation that has nice properties (smooth everywhere,
invertible, can be constructed for any set of pointwise correspondences) are
diffeomorphisms.

Unfortunately, the mathematical objects are infinite-dimensional, which makes
computation with them somewhat involved, even if they are derived from a
small-dimensional set of point correspondences. See, e.g.,
<[http://www.cs.unc.edu/Research/MIDAG/pubs/papers/Joshi_TIP_2...](http://www.cs.unc.edu/Research/MIDAG/pubs/papers/Joshi_TIP_2000.pdf>)
for one method of constructing them.

------
z3phyr
Off topic: I wish, there would be a practical way to program for neuromorphic
chips, which are inherently analog. Anybody have some idea or links regarding
this field?

------
designium
So happy to see a friend on the front page! Go Colah!

~~~
d1plo1d
w00t!

------
aspensmonster
The typeface and layout are beautiful! As is the work itself. I love it!

