So what could you possibly mean when you say neural networks have "more" than they need to do something which characterizes exactly what they can and can't do?
Seems to me that jumping to whether they can "learn any function" as you asked, would certainly have come after this is first established.
In 1969 in a famous monograph entitled Perceptrons, Marvin Minsky and Seymour Papert showed that it was impossible for a single-layer perceptron network to learn an XOR function.
A book came out a little bit ago which covers basic neural networks, and culminates in showing a network that can estimate XOR . I strongly recommend reading not only the chapter, but the whole book.
Minsky and Papert hate him! Use this one weird trick ….
(Sorry; I couldn't resist!)
The initial flurry of research activity slowed to a background simmer, however. As a steady march of implementation problems was knocked down, it became clear that the limitations of neural nets were inherent to neural nets themselves, rather than any particular implementation detail. For example, the Lagrangians calculated by SVMs are provably convex, in contrast to the lack of a convexity guarantee for neural nets, which happily gradient descent into local minima. There were workarounds, of course, with regularization techniques, stochastic descent, etc. providing some measure of relief, but a migration of research interest away from neural nets seemed increasingly promising, and today, the migration seems largely complete.
What are you talking about? Deep learning is one of the hottest areas of research today, and a lot of it has to do with neural networks. NN's are the state of the art in several domains. Case in point: http://image-net.org/challenges/LSVRC/2014/results. All of the top entries use convolutional networks; in fact, almost all of the entries do.
The fact that the loss function represented by a neural network can be highly nonconvex is what makes them so effective in the domains in which they are used. See this presentation by Yann LeCun for more info: http://www.cs.nyu.edu/~yann/talks/lecun-20071207-nonconvex.p...
"ML theory has essentially never moved beyond convex models, the same way control theory has not really moved beyond linear systems. Often, the price we pay for insisting on convexity is an unbearable increase in the size of the model, or the scaling properties of the optimization algorithm ... This is not by choice: nonconvex models simply work better.
Have you tried acoustic modeling in speech with a convex loss? ... To learn hierarchical representations (low-level features, mid- level representations, high-level concepts....), we need “deep architectures”. These inevitably lead to non-convex loss functions."
This isn't to say that NN's are going to solve all our problems, but to say that there has been a shift in interest away from NN's is absurd.
The next quantum leap is expected with the introduction of more specialized hardware such as neuromorphic chips: http://www.technologyreview.com/view/428235/intel-reveals-ne...
TLDR: The authors create adversarial examples, i.e., they slightly modify the original images, which look exactly the same to humans but neural networks can't classify the images correctly anymore. What does that imply on the ability to generalize? :)
On a more general note: NN are always treated as something magic. I think a "sober view" is that NN are a special way to parameterize non-liner functions. This is neither good nor bad but it's easy to see when you look at, for example, a 2 layer NN:
$f(x) = w_2^T \sigma(W_1 \sigma (W_0 x))$
Usually that means that the model is experimenting overfitting, and that's actually one challenge on any machine learning model.
What matters is not only whether they can learn it but how much data they need to learn it to a given degree of accuracy. This is the kind of question addressed by nonparametric statistics and statistical learning theory.
comparing neural net neurons to logic gates in a computer circuit or FPGA, gives a nice intuitive understanding of why the training works in the first place. The training essentially sets up a custom virtual circuit for whatever function you need.
Given this property, it seems intuitive that they would be able to learn any function, since any function is representable as logic gates.
NNs themselves are sort of agnostic to the algorithm you use to set their weights. It's sort of like saying "computers are Turing complete, but can we ever prove that human programmers are smart enough to program any function."
> "A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E". This definition is notable for its defining machine learning in fundamentally operational rather than cognitive terms, thus following Alan Turing's proposal in Turing's paper "Computing Machinery and Intelligence" that the question "Can machines think?" be replaced with the question "Can machines do what we (as thinking entities) can do?"
It's important, because we don't need to care about cognition or consciousness, and we can still write programs that solve problems well by making inferences from patterns in data.
Once you add all three qualifiers in (approximate/continuous/compact) it starts to sound more like math and less like a miracle.
Incidentally, one thing of great interest is, how does the number of hidden units required behave as a function of dimensionality of the input domain. In dramatic language, "Can neural networks get around the curse of dimensionality?"
The Cybenko proof does not give enlightenment about that question. Andrew Barron (http://www.stat.yale.edu/~arb4/) had some results that seemed to indicate the dependence was moderate (not exponential). I'm not aware what the state of the art currently is.
I agree. The problem with attempting to make subtle technical points digestible by untrained people is that the larger meaning of the exercise is likely to be lost along with the details. This would probably be a good demonstration to show to investors in your NN-based startup, because it seems impressive. But if I had to state the take-home message of this that the average person should care about, I'm not sure there is one. There are simpler systems with the same property that are taught to undergraduate engineers (and NNs don't represent a novel path to having it, even); there's a great distance between approximating functions and solving practical ML problems.
To me, personally, I would have been more interested had this been an argument about back propagation for training NNs. I suspected NNs could do this because of what's in them, but training is as much of a constraint on what they can actually do.
Later on in the article, it's qualified, but tempers are already rising.
Here come people with their non-computable functions, their unmeasurable functions, their nowhere-continuous functions, all wanting to get approximated. In sup norm! On unbounded sets!
Irksome, greedy, and vexatious.
If you mean learn functions efficiently with few parameters, or in ways that generalize well, the article makes no claims about that at all. There are two "no free lunch" theorems. One of which says that it's impossible to guarantee any method will generalize well on all problems, and the other that says it's impossible to guarantee any optimization method will work well on all problems. You just have to make strong assumptions like "my function can be modeled by a neural net" or "the error function is convex."
However neural networks are agnostic to the optimization algorithm you use to set their weights. There are many different ones.
If I recall correctly, a non-linear problem can be solved as a linear problem if you consider more dimensions. The hidden layer add dimensions. So, it's not a function of the input domain but of the problem domain, which usually isn't explicitly known.
y = f(x)
Some problems (i.e., choice of "f") could be easy. Maybe f only depends on one element of x, for example. There would be no curse of dimensionality in this case. Same situation if "f" depends only on any fixed number of elements of "x".
I think this is roughly what you mean by "dimension of problem domain." Fix an "f", that defines a problem. And you're right, efficient solution of that fixed problem is important!
My remark (which is also in the last paragraphs of the Cybenko reference cited in the OP) had to do with increasingly difficult problems.
How to get such a sequence of problems? Suppose you take a simple function like
f(x) = exp(-0.5 * dot(x,x))
The question is, is there an explicit dependence on dimensionality, and is that dependence exponential in d?
And of course, for more general function classes (not just the single Gaussian function above), is there such a dependence? If it is not exponential, that would be astonishing, revolutionary.
The reason this setup ("vary the problem size") is interesting is that we would clearly like to use neural nets for increasingly higher-dimensional problems (e.g., learn appearance of 32x32 cats, then 64x64 cats, then ...).
One has to be careful to separate the functional-approximation problem (which is in the OP) from the statistical-model-selection problem (which is NOT in the OP).
The statistical-model-selection problem is easier in some ways (you don't have to approximate the function everywhere, just where you get data; if you didn't get data somewhere, you don't care what happens there).
It is harder in some ways (all that probabilistic folderol, plus, your data is noisy).
There are results that give rates for the functional-approximation problem. By rates, I mean, how good is the approximation versus number of hidden units. The best work I know of is by Andrew Barron, but that was in the mid/late 1990s. He's like a genius bulldozer, so his papers are tough going. You'll note that the Cybenko results, like in the OP, do not give rates. This is obviously a huge difference.
There are also results in the statistical-model-selection problem, of course. With rates. That's Vladimir Vapnik's big contribution, later carried on by others. One of the main results is that a model class has an intrinsic complexity (VC dimension, or other measures) and you only need to have order-of that many random probes to get (close to) the best model in the class. No matter what the dimensionality of the space where the data lives.
This is precisely the point you're making above.
My comment is getting too long, but: notice that the statistical-model-selection work only speaks about how to get (close to) the best model in the model class. You then have to make sure your model class is full enough to get close to the optimal rule (which is chosen by Nature, or whatever).
So to make a full theory, you need both pieces: how to choose a big-enough model class (functional approx.) and how to select a good-enough member of the model class (statistical model selection).
Typically there is a total error having one term for each of the two effects, and therefore a balancing act between making the model class big enough to approximate anything, vs. making it small so that you can easily select a good model. If you look at it sideways, this therefore looks like an optimization problem with a Lagrange multiplier penalizing model complexity. So it's quite elegant.
"A complex pattern-classification problem, cast in a high-dimensional space nonlinearly, is more likely to be linearly separable than in a low-dimensional space, provided that the space is not densely populated." - Cover, Geometrical and Statistical properties of systems of linear inequalities with applications in pattern recognition., 1965
From the last paragraph of the paper by George Cybenko referenced in the OP:
"While the approximating properties we have described are quite powerful, we have focused only on existence. The important questions that remain to be answered deal with ... how many terms in the summation (or equivalently, how many neural nodes) are required to yield an approximation of a given quality? [...] We suspect quite strongly that the overwhelming majority of approximation problems will require astronomical numbers of terms. This feeling is based on the curse of dimensionality that plagues multidimensional approximation theory and statistics."
[I'm highly familiar with both the paper by Cybenko, and the paper by Tom Cover on linear separability that is the source of the Wikipedia quote pasted above, having gone through them carefully as part of my PhD. Andrew Barron, mentioned above, was Tom Cover's student, and the work Andrew did can be viewed as another approach to this problem.]
that applies to all of math. There are no miracles.
> that applies to all of math. There are no miracles.
Err, what applies to all of math? That it sounds like math?
I do not agree that there are no miracles—plenty of mathematics is miraculous. As a first example off the top of my head, the areas of squares that can be drawn with vertices on the integer lattice in the plane are the integers whose prime factorisations include an even number of times each prime `p` for which `p + 1` is divisible by 4. This is a miraculous relationship between geometry and number theory, and it's just the tip of an iceberg!
I find the idea of a neural net finding a limitation of a neural net, to be interesting.
Your brain is not a neural net.
If we take the phrase neural network to mean, "a series of deeply and widely interconnected elements which assist or inhibit the transmission of signals to one another, activated by the inputs exceeding a threshold," then I think there's a pretty good amount of overlap.
I understand there's a temporal signaling model (pulses at varying rates, not steady state signals) and stochastic information (like random firing and a bunch of noise), but once we abstract out the axons, dendrites, and neurochemicals, is there another piece of functional equipment which drastically effects things? How does our simplified view that small, individually stupid pieces, acting in concert to produce complex behavior differ from the real brain?
I'm not a neurobiologist (if you are please correct/clarify!), but here are just a few of doubtless many significant differences as I understand them:
* Real neurons don't update in a bunch of coordinated discrete timesteps, as an ANN learning algorithm does. They can fire independently and in continuous time.
* Real neurons' activation behaviour is closer to sudden delta-function-like spikes, rather than a smooth activation function or something which allows them to stay in a firing state for longer than a short pulse.
* The structure of the connection graph of neurons in the brain is incredibly more complex than that of artificial neural net models, can change over time, isn't split into obvious layers from input to output, there isn't a clearly-identifiable error signal which is used to train it at the outputs, ...
* A real neuron is an incredibly complex biological system whose behaviour can be modulated in all sorts of ways and which I expect would take a nightmarishly complicated set of PDEs to model with any degree of realism even as an isolated unit. To think we've captured all aspects of their behaviour relevant to human cognition with a simple weighted sum and a sigmoid (say) seems pretty naive.
Details aside though, the idea that you can start drawing deep philosophical conclusions about the nature of human thought, our ability to conceive of our own limitations etc, based on an analogy between a complex biological system and a simple mathematical formalism which is at best loosely influenced by certain limited aspects of it -- it's just silly, one of those "not even wrong" sort of statements.
I think its worth pointing out that when computer science researchers play with different kinds of neural nets which expand beyond the simplistic version used in the OP, they still call them neural nets. I have seen abstract computer-driven neural nets that fire independently and in (more) continuous time, are plastic, lack clear layers, etc.
I think when you said: "but I'm talking about a neural net as a specific mathematical model here" that is fair, you are asserting that while this model similar to a biological one at some very primitive level the model of neural net used here is so far removed for our biological implementation as to render any comparison non-meaningful. I believe that is possible, but we still have discussed evidence that is the case.
I thought it might not be "silly" question: Consider that a turning machine simulated with pencil and paper, given enough paper and time is capable of doing any calculation your desktop computer can. Now it may take 10,000+ years and many "operators" lives, but that is besides the point.
I was thinking that the comparison between the pencil and paper turning machine and the desktop machine, might be similar to the this very basic neural net model and our own much more complex biological instance. The idea is that the two systems while varying vastly in efficiency might equivalent in what they can theoretically compute.
I don't see how asking the question of how much we can simplify our biological neural net and still have be computationally equivalent to the original (even if less efficient) is a silly question.
(I don't like silly, it implies the question is not worth asking. Calling your students question silly is a good way to make them never ask another.)
To put it simply, it is possible for ANNs to fail at something and for us to succeed at the same thing.
You could say that a neural network found this limitation of neural networks, to the extent that neural networks could be defined in terms mathematical logic. However, it's not guaranteed that the neural networks in our brain could be explained in these terms--the physical processes underlying them are not fully understood.
This is of course true, but nevertheless it is an interesting result (hence why we study Gödel's incompleteness theorem).
Recurrent networks are needed for stateful operation, i.e. where some kind of memory is needed (in any case where the input is spread across some time or the sequence of data is important). And learning in recurrent nets is in very early stages unfortunately.
Adding more layers improves on this and allows you to make functions that compose multiple smaller functions. The problem with this is the nonlinearities cause the gradients to explode or vanish after a few layers. So the amount of computing power required to train them is huge.
Recurrent NNs had the same problem since they are equivalent to a very deep feed forward network; where every layer is a time step and the weights between every layer are the same.
But the invention of Long Short Term Memory has made training RNNs practical. Basically, as I understand it, some connections do not use nonlinearities so the gradients don't explode or vanish.
My point of view was that neurons in neural nets are essentially analogue logic gates. Given that combinations of logic gates are Turing complete, combinations of neural net neurons should be also.
My writing is not quite as nice or rigorous as the parent post, but the whole point of the blog is to get better at self-expression and explaining things.
When approximating a function, one must also talk about the sense in which the approximation is being made. That is, L^2? H^1? Pointwise?
I'm not aware of universal approximation results for random forests, but since they have the same general construction, this would not be surprising.
Any more news about those chips that were optimized for neural networks ? Was it IBM or Samsung ?
I don't think I understand your point. By awesome I meant it's a very exciting/interesting field of research. I don't mean to call thermodynamics awesome because it powers cars or because it's a well research field.
Trying to understand how the brain works or making a computer that works in a similar way is awesome to me. If there are still computational limitations that makes it not practical, that's still an awesomely interesting subject.