

A visual proof that neural nets can compute any function - jbarrow
http://neuralnetworksanddeeplearning.com/chap4.html

======
Tloewald
Can't the same be said for Fourier series, which make no claims to be some
kind of AI? And likewise humble polynomials:

[http://en.wikipedia.org/wiki/Stone%E2%80%93Weierstrass_theor...](http://en.wikipedia.org/wiki/Stone%E2%80%93Weierstrass_theorem)

~~~
mtdewcmu
Yes (I am not expert in neural nets, but that appears to be exactly what this
is saying). If you look at what goes into a neural net and compare it to what
goes into a Fourier transform, it should be obvious that neural nets have even
more than they actually need to do this task.

~~~
j2kun
This statement doesn't make sense to me. A neural network literally _can 't_
produce anything besides a continuous function, and the universality theorem
says that there is no continuous function they can't (approximately) produce.

So what could you possibly mean when you say neural networks have "more" than
they need to do something which characterizes exactly what they can and can't
do?

~~~
mtdewcmu
There are more coefficients than necessary. The FT has the minimum; it's
bijective.

------
mooneater
Who cares if they can compute any function. The important question is, can
they _learn_ any function, and can they learn in a way that can generalize?
(And clearly they can for many useful domains).

~~~
akuma73
What about XOR?

From Wikipedia: In 1969 in a famous monograph entitled Perceptrons, Marvin
Minsky and Seymour Papert showed that it was impossible for a single-layer
perceptron network to learn an XOR function.

~~~
jo_
'Single-layer' is an easily overlooked and absolutely essential modifier. Very
few networks these days are single layers. (Three is the minimum for infinite
dimensional functional approximation, given nonlinear activations.) Deep
Networks you see in new research papers these days have, at minimum, perhaps
three layers. Some by LeCun et al. will go as high as nine or twelve layers,
with four or sixteen layers in breadth.

~~~
szabba
I'm pretty sure linearity was also part of the picture -- no matter how many
layers of neurons you have, as long as they perform linear combinations, they
could all be replaced with a single one (ignoring numeric errors).

~~~
jo_
I mentioned this in my post ("given nonlinear activations"), but you're
correct to point out its importance. The combination of linear functions is
always linear. Full stop. You need at least one non-linear layer to get
reasonable results.

------
tomp
As mentioned in the article, the formal statement is actually "neural nets can
approximate (arbitrarily well, using the supremum metric) any continuous
function". For other norms, it can also approximate non-continuous functions.

~~~
mturmon
You also need the qualifier "...any continuous function, _on a compact set._ "

Once you add all three qualifiers in (approximate/continuous/compact) it
starts to sound more like math and less like a miracle.

Incidentally, one thing of great interest is, how does the number of hidden
units required behave as a function of dimensionality of the input domain. In
dramatic language, "Can neural networks get around the curse of
dimensionality?"

The Cybenko proof does not give enlightenment about that question. Andrew
Barron ([http://www.stat.yale.edu/~arb4/](http://www.stat.yale.edu/~arb4/))
had some results that seemed to indicate the dependence was moderate (not
exponential). I'm not aware what the state of the art currently is.

~~~
rcfox
> how does the number of hidden units required behave as a function of
> dimensionality of the input domain

If I recall correctly, a non-linear problem can be solved as a linear problem
if you consider more dimensions. The hidden layer add dimensions. So, it's not
a function of the input domain but of the problem domain, which usually isn't
explicitly known.

~~~
mturmon
The question in the OP concerns approximation of the f(.) in:

    
    
      y = f(x)
    

where the input "x" is d-dimensional (say).

Some problems (i.e., choice of "f") could be easy. Maybe f only depends on one
element of x, for example. There would be no curse of dimensionality in this
case. Same situation if "f" depends only on any fixed number of elements of
"x".

I think this is roughly what you mean by "dimension of problem domain." Fix an
"f", that defines a problem. And you're right, efficient solution of that
fixed problem is important!

My remark (which is also in the last paragraphs of the Cybenko reference cited
in the OP) had to do with _increasingly difficult problems_.

How to get such a sequence of problems? Suppose you take a simple function
like

    
    
      f(x) = exp(-0.5 * dot(x,x))
    

i.e., the Gaussian, and approximate it with a linear superposition of 1-d
sigmoids (as a neural net would).

The question is, is there an explicit dependence on dimensionality, and is
that dependence exponential in d?

And of course, for more general function classes (not just the single Gaussian
function above), is there such a dependence? If it is _not_ exponential, that
would be astonishing, revolutionary.

The reason this setup ("vary the problem size") is interesting is that we
would clearly like to use neural nets for increasingly higher-dimensional
problems (e.g., learn appearance of 32x32 cats, then 64x64 cats, then ...).

~~~
mtdewcmu
I think I know the problem you're referring to (despite not having the right
kind of specialist training to actually work in this field). I think that the
fact that random or quasirandom probes are able to explore these high-
dimensional spaces fairly effectively is evidence that the problem is somehow
not exponential and therefore tractable in some fashion. Does that sound
relevant?

~~~
mturmon
Yes, your observation is relevant.

One has to be careful to separate the functional-approximation problem (which
is in the OP) from the statistical-model-selection problem (which is NOT in
the OP).

The statistical-model-selection problem is _easier_ in some ways (you don't
have to approximate the function everywhere, just where you get data; if you
didn't get data somewhere, you don't care what happens there).

It is _harder_ in some ways (all that probabilistic folderol, plus, your data
is noisy).

There are results that give _rates_ for the functional-approximation problem.
By rates, I mean, how good is the approximation versus number of hidden units.
The best work I know of is by Andrew Barron, but that was in the mid/late
1990s. He's like a genius bulldozer, so his papers are tough going. You'll
note that the Cybenko results, like in the OP, do not give rates. This is
obviously a huge difference.

There are also results in the statistical-model-selection problem, of course.
With rates. That's Vladimir Vapnik's big contribution, later carried on by
others. One of the main results is that a model class has an intrinsic
complexity (VC dimension, or other measures) and you only need to have order-
of that many random probes to get (close to) _the best model in the class_. No
matter what the dimensionality of the space where the data lives.

This is precisely the point you're making above.

My comment is getting too long, but: notice that the statistical-model-
selection work only speaks about how to get (close to) the best model in the
model class. You then have to make sure your model class is full enough to get
close to the optimal rule (which is chosen by Nature, or whatever).

So to make a full theory, you need both pieces: how to choose a big-enough
model class (functional approx.) and how to select a good-enough member of the
model class (statistical model selection).

Typically there is a total error having one term for each of the two effects,
and therefore a balancing act between making the model class big enough to
approximate anything, vs. making it small so that you can easily select a good
model. If you look at it sideways, this therefore looks like an optimization
problem with a Lagrange multiplier penalizing model complexity. So it's quite
elegant.

------
numlocked
This is also why Neural Nets are susceptible to overfitting and fell out of
vogue in the 90s :) They will merrily fit themselves, very precisely, to your
noisy, wiggly data. Obviously there are ways to combat this, but it seems like
an corollary to their 'universality'.

------
sz4kerto
Feedforward NN's are only useful to a very narrow* set of problems (*->
compared to 'all' problems out there).

Recurrent networks are needed for stateful operation, i.e. where some kind of
memory is needed (in any case where the input is spread across some time or
the sequence of data is important). And learning in recurrent nets is in very
early stages unfortunately.

~~~
Houshalter
They are universal function approximators, which means they can map any set of
input values to any set of output values. Of course to do this sometimes
requires rote memorization of every possible input and it's output, rather
than generalizing the function with a few parameters.

Adding more layers improves on this and allows you to make functions that
compose multiple smaller functions. The problem with this is the
nonlinearities cause the gradients to explode or vanish after a few layers. So
the amount of computing power required to train them is huge.

Recurrent NNs had the same problem since they are equivalent to a very deep
feed forward network; where every layer is a time step and the weights between
every layer are the same.

But the invention of Long Short Term Memory has made training RNNs practical.
Basically, as I understand it, some connections do not use nonlinearities so
the gradients don't explode or vanish.

------
mostly_harmless
I wrote about this a few weeks ago:

[http://serialprog.blogspot.ca/2014/07/neural-networks-
like-c...](http://serialprog.blogspot.ca/2014/07/neural-networks-like-custom-
virtual.html)

My point of view was that neurons in neural nets are essentially analogue
logic gates. Given that combinations of logic gates are Turing complete,
combinations of neural net neurons should be also.

My writing is not quite as nice or rigorous as the parent post, but the whole
point of the blog is to get better at self-expression and explaining things.

------
bmease
I loved reading that and interacting with the plots. It totally changes the
dynamic of learning when you can interact and play with it in real time.

------
theophrastus
If neural nets can compute any function (as seems neatly proven here) then can
they compute any function in more than a single way? If so, then upon applying
the novel input (which was our goal following training) how can we know that
the particular way which was computed via training set is 'right' for our
novel input? If this is all true then it would seem to make neural nets
perfectly unreliable as a means to modeling..?

------
Dn_Ab
This is a wonderful post. One minor aspect which nags me is that when I read
"any function", I think any "effectively calculable method". But regular
feedforward MLPs are not Turing Complete (will you be going over recurrent or
recursive networks?). If so, it would be useful to note this distinction as
I've never seen that point for confusion dealt with properly in one place.

------
mkoryak
For me "visual proof" was a java program I wrote in 2005 that used genetic
algorithms to evolve a neural network checker AI. It beat me every time:

[https://github.com/mkoryak/Evolutionary-Neural-Net-
Checker-A...](https://github.com/mkoryak/Evolutionary-Neural-Net-Checker-AI)

------
coherentpony
Computing a function and approximating a function are two different
disciplines.

When approximating a function, one must also talk about the _sense_ in which
the approximation is being made. That is, L^2? H^1? Pointwise?

------
pesenti
Does anybody know if this is true for other machine learning techniques?

~~~
mturmon
The same arguments as in the original Cybenko paper, or the Stone-Weierstrass
theorem, lend support to the idea that SVMs are universal approximators (with
most typical kernels). This has been proven by a couple of authors.

I'm not aware of universal approximation results for random forests, but since
they have the same general construction, this would not be surprising.

------
jokoon
who's to doubt neural networks are completely awesome ?

Any more news about those chips that were optimized for neural networks ? Was
it IBM or Samsung ?

~~~
j2kun
We have very few theoretical results about neural networks. So not
_completely_ awesome.

~~~
jokoon
I mean there is a big future for neural networks

~~~
j2kun
Well you asked a question and I answered it honestly :)

~~~
jokoon
> So not completely awesome.

I don't think I understand your point. By awesome I meant it's a very
exciting/interesting field of research. I don't mean to call thermodynamics
awesome because it powers cars or because it's a well research field.

Trying to understand how the brain works or making a computer that works in a
similar way is awesome to me. If there are still computational limitations
that makes it not practical, that's still an awesomely interesting subject.

------
signa11
may i also humbly suggest the two vol. series called "parallel distributed
processing" (rumelhart et al) which provides an excellent overview of _early_
NN research.

------
mikeash
I liked this article a lot, but I found it extremely confusing how some of the
diagrams were interactive and some weren't. Why not make them all interactive?
Barring that, an obvious visual indicator when one is interactive would be
handy. As it was, I clicked on a lot of static images.

~~~
lucio
[https://www.youtube.com/watch?v=KpUNA2nutbk](https://www.youtube.com/watch?v=KpUNA2nutbk)

