Until last week , it was an open problem whether every function with n inputs that is computable in nondeterministic time 2^O(n) could also be computed with a two-layer circuit using only O(n) gates (that is, a deep net with just one hidden layer with a linear number of gates). This is an embarrassing state of knowledge.
Now we know the following slightly less embarrassing thing: there is an explicit function, computable in linear time, that needs at least n^3/2 gates in a two-layer deep net (modulo some log(n) factors), and another function that needs n^3/2 gates in a three-layer deep net (with the additional restriction that the output gate is a majority vote of the previous layer).
This is still a long way away from truly understanding the representational power of deep nets, but it's the first general progress that has been made since the early 90's.
Disclaimer: I reinvented the idea about a month or two ago, and ran into the existing papers when googling for joint entropy estimators after making some interesting graphs with hierarchical probabilistic programs. I only wish I could've been the first!
Shiro Usui, Shigeki Nakauchi, and Masae Nakano, "Reconstruction of Munsell color space by a five-layer neural network," J. Opt. Soc. Am. A 9, 516-520 (1992)
The whole idea of "simplest model that explains most of the data" has always been very appealing to me. The concept is closely tied to reproducing kernel Hilbert spaces, which have recently experienced a revival in interest due to the representer theorem.
Why? Is that a standard dataset? They could have just cherrypicked tasks.
I also noticed that they seem to only have the actual computation working for discrete random variables (integer features) right now, which limits its applicability. They also seem to use mass-function estimators, which again can work well for discrete data while becoming statistically intractable when dealing with continuous random variables.
>The whole idea of "simplest model that explains most of the data" has always been very appealing to me.
In this case it's a bit more like, "The set of latent variables that best screen off the observables from each-other."
this happens all the time. i'm thankful that you took the time to post the results, because a discovery that's not communicated is not really a discovery. indeed, i suspect that what's going on in the field is as much a lack of communication as much as a lack of understanding.
i wish the author of the original article had included links to at least a couple of of the "number of tools to probe the geometry of these hidden structures."
Only commenting on that portion, as it's what I know. The author doesn't appear to give any indication of which test was used, and where the sample was taken from. This makes me a little suspicious, but its probably just a disciplinary thing.
Additionally, many tests reduce the pool of available questions for a scale after testing it against against other scales + theory. This means that what the author is recovering, is the scale imposed on the items by the original authors.
Nonetheless, this has forced me to reinvestigate the Big5 model, so it's not all bad :).
Massive appreciation for the links.
Probably not as theoretical as the work you referenced, but interesting to me because of the deeply practical outcomes in NLP.
Exactly. Someone should tell Yann LeCun this --see e.g. Section 3.2 in , or pretty much every time he brings up circuit complexity theory to automagically imply that "most functions representable compactly with a deep architecture would require a very large number of components if represented with a shallow one." [ibid, p.14]
Not sure if it tells us more about why the network works, but they sure as heck get rid of a lot of connections.
Most neural network models assume that all neurons in one level are fully connected to all neurons in the next level. This leads to confusion about how ANNs work.
From my research I'd argue that MOST interactions in these networks are spurious. Once you remove them it reveals the (visual) topology (circuit diagram) that's driving the function of that network.
In the paper I wrote, I evolved gene regulatory networks (of ANNs have the same mathematical representation) such that interactions between any two nodes could be deleted, created (if W_ij = 0), or modified according to probabilities of deletion, creation, and modification. Given these probabilities, you can calculate the number of interactions that should result when the network reaches equilibrium, however what I found was that the network evolves less interactions than you would expect from the equilibrium calculation. This says that all things being equal, a network is paying a price for spurious interactions and that these will be removed in an evolutionary environment. Basically, each interaction needs to pay it's way otherwise it leads to unnecessary complexity that reduced the fitness of the network.
Let's say a specific problem has only one very specific set of connections that matters. You'll eventually add up with weights that reflects that, but that doesn't prevent a lot of other connections from having weights set during training, but that may end up being cancelled out or reduced enough to have no meaningful impact whatsoever on the end result.
While sone might wave their hands and call this an eli5, it was quite well done and between that and some xkcd comics I was able to learn enough to ubderstand how orbit works and a few other awesome facts.
I am interested in finding something similar for machine learning. I am not embarassed to admit that, Super-Linear Gate and Super-Quadratic Wire Lower Bounds for Depth-Two and Depth-Three Threshold Circuits, is just way too complex for my casual interest but would like to know more about how AI/ML works conceptually. Clearly the discipline underpins a bunch of services I use regularly and as computing and techniques improve it will start playing a more conspiscious role.
However, there are many papers that explore various ways to make a network learn, and they keep improving on performance, suggesting they're on to something. There are also many papers that discuss possible theoretical implications of experimental results.
But what does Knowing Why The Network Works mean exactly? "It works because universal approximation and gradient descent", but that's not a very satisfactory answer. "It works because it starts at a general solution and, over the course of many iterations, takes many small steps in an ever changing direction defined by a gradient approximation generated by looking at the difference between an average error and a target output (which should trend towards 0)".
What would a satisfactory "why" even look like exactly? As in, what form might it take compared to some other scientific discipline where we do know what's going on?
Personally, I think the whole thing is a red herring -- people in the field have some idea of how neural nets work, and there are many disciplines considered by many to be mature sciences that are far from settled on a grand theoretical scale.
That said, the theory I'm most interested in is recent attempts to connect a memory module to neural networks so they can "learn" to store important/complex/distributed information that can be recalled with high accuracy later. That will make it easier to do things like ask a neural network to remember your name, or where you left your keys, or whatever.
To me this is obvious: a proof is an adequate answer. All of the explanations given in this area are heuristic reasons. They're nice, but we can't know if they're correct or if we're being fooled by our intuition without a proof.
What you get instead of local minima are saddle points, where some dimensions are curving up and some are curving down. Saddle points can also be problematic for optimization but they can be dealt with using fancy optimization techniques. For a more rigorous explanation, see http://arxiv.org/abs/1406.2572
ELI5 version: One can ask, what's the probability that a random person drawn from Earth's population (points in high dimensions) owns a Bugatti (is a local minimum)? It's very small, obviously. But that doesn't tell us anything about the probability of Bugatti ownership among select subsets of people (critical points).
An example I can think of would be an absurd million input neural network, where one of the inputs only has a pronounced effect on one of the outputs. It seems like it would be possible for the path of the input to output to be dragged downhill in the context of all outputs, but uphill in the context of the single output it affects.
Is what I've described not likely, or am I just completely off base?
If the intervals spanned by x,y,z do not overlap, then your statement does not hold unless local minima are evenly distributed over the entire region, which defeats the purpose of searching for minima in the first place. If the intervals spanned by x,y,z are identical, then gradient descent can be reduced to just searching over f(x) by symmetry. You are right that there are m^n minima, but they are found in (nm)^1 steps. If the intervals spanned by x,y,z partially overlap, you can further reduce the search space to just f(x) and the union of the intervals.
I don't see how your example shows that gradient descent is likely to fail for any classifier with a sufficiently high number of independent parameters. Am I missing something?
Before deep learning happened, neural networks used to be popular for regression and interpolation problems. For a long time, our understanding of how they worked wasn't much better than our current understanding of how deep learning works.
In 1994, Radford Neal showed  that weight-decay neural networks were in fact an approximation to Bayesian inference on a Gaussian process prior over the space of possible functions. Amongst other benefits, this allowed better approximations to be utilised, and essentially (along with the advent of SVMs) marked the end the first neural network era.
Something like that is what I would consider a satisfactory "why"
I would say it means: Why do these algorithms seem to be doing "well" despite their well-known theoretical intractability? Is it something about the instances? Or is it simply a matter of scale (so the local optima they are getting stuck in are not as obvious)? Or is there something "deep" we do not understand yet about these algorithms (or the theory)?
Either way, there is something going on here for sure.
You can download open source SAT solvers today that work spectacularly well on "real" SAT instances with millions and millions of variables and clauses. Yes, SAT is the quintessential NP-complete problem and in fact it is pretty easy to come up with a SAT instance with only a few tens of variables/clauses that would kill these solvers. But somehow these "hard" instances almost never occur in practical problems generated in hardware/software verification/synthesis as well as ton of other applications (planning, constraint programming etc.)
So this must there is some characteristic of the problems that we generate in practice that makes them "easy" but we don't have a good understanding of what this characteristic is. All we know, for now, is that we've somehow stumbled upon a near-perfect set of heuristics that work amazingly well on the SAT instances we encounter in practice.
There's also the intriguing possibility that "natural instances" however one might define them arise through some complicated process that ends up sampling from easy SAT instances. I think of it as analogous to the Benford's law (see: https://en.wikipedia.org/wiki/Benford%27s_law#Explanations) except that we do not have nearly as good explanations for SAT solvers.
It is also highly likely that MOST instances of SAT are easy but that doesn't preclude it from being NP hard, nor does it stop us from easily constructing hard instances. (Such things are absolutely necessary for cryptography as an example.)
My feeling is there's a bit of each; the problem space has a bit of unknowable quasi-structure, and also tends to miss the perverse corner cases and counterexamples that we like to think about as mathematicians.
If you want something more "high-level", and I'll look for something appropriate.
Which almost immediately suggests a solution to why NN learning works: the processes that produce the types of datasets humans are interested in are produced by (effectively) polysized networks and you can probably say things like the probability of recovering a polysize function from polysize samples is high.
OTOH, the parent comment suggested we should draw our attention to the processes that produce the data and find correspondences to how NNs decompose it.
The famous AlexNet  that blew away the ImageNet competition in 2012 contained 8 layers; more recent networks have even more.
But you're making it sound like shallow networks are a thing of the past. I would compare this to the field of NLP, where it seems we don't have a good general idea what to do with deep networks, and the things that work so far are mostly shallow.
word2vec is one layer plus a softmax. GloVe is similar to a one-layer network. char-rnn seems to do as well with two layers as it does with three. All the gradient-descent classifiers out there are equivalent to one-layer networks.
Another bit that stood out for me:
> You’re awarded an extremely generous grant that allows you to give 200,000 people a 500-question personality test, with answers that vary on a scale from one to 10.
Dating sites like OkCupid should have such data.
I guess more people feel that way, and I guess that is a good thing, otherwise everybody would now be working on AI, since it is such a promising field.
If we are depending on AI systems to make important decisions, possibly life-or-death decisions, shouldn't we understand how they come to the conclusions that they do?
I cannot deny the tremendous advances in neural networks recently, but I find Sussman's point compelling.
maybe it's the combination of sheer computing power and availability of data that allows better models. maybe we've never looked at algorithmically generated models because we also want a narrative (commonsensical explanation) to the model, not just a matter of algorithmically finding correlations between jelly beans and acne, say.
part of me thinks, ok the machines found something. now can we actually use that to understand the world, rather than build more recommendation algorithms? (haha). I'm not sure what an advisor will say about a doc student who says, let's just throw reams of data at a machine until we find a meaningful correlation, and then let's reason from the correlations (my guess is 'no', that's not really the scientific method is it).
I'm hopeful, and I don't think the answer to the question will come from academia.
Currently they are saying yes, please, and desperately trying to source ever bigger volumes of data.
Norvig et. al's article "The unreasonable effectiveness of data" has been cited over 400 times which is a lot for something with no formula.
The word you want is "interpretable". Powerful blackbox models have been known for a while, but some industries prefer interpretable models like linear and decision trees, sometimes you're even forbidden from using others due to regulation.
I'm reminded of the time Google Translate autodetected "Gesundheit" as Spanish. And Gmail kindly offering to translate "hahaha" from Portuguese, putting an ad for coconuts next to it.
Data science is improving, but you might be surprised how slowly. Especially in the consumer space, because the metrics on effectiveness are so warped.
Voice recognition of numbers only, over a phone connection, can be below 40% accuracy! Much of the perceived success of these systems comes not from the core machine algorithm, but from clever human tweaks around it. Also end-users who are happy with what they get, not quite realizing how goofy it all is if they were to get a glimpse of the raw data.
We have machines that can categorise pictures better that humans. In 2011 that seemed completely impossible.
A Google Image search for "Wonder Wheel" (the famous Coney Island Ferris Wheel) shows this spoked diagram within the first page of results:
Also this year, Google Photos classified black people as gorillas.
Consumers are rarely exposed to the raw machine output -- for good reason. My experience building these sorts of systems is that they're pretty goofy and they fail unexpectedly. After chasing audio and video problems using custom software as well as three major toolkits, I find myself hyper-aware of the flaws in public systems.
Also common sense dictates that it's more about the data scientist on the way in and the UX person on the way out than the machine.
The first result for me is https://upload.wikimedia.org/wikipedia/commons/2/27/Wonder_W...
Looks good to me?
As for the gorilla incident, I don't think anyone is claiming that errors don't occur, and it's very fair to say that particular error was very embarrassing for Google. It's interesting how children make the same kind of embarrassing mistakes, eg: https://www.reddit.com/r/Parenting/comments/24me24/embarrass...
That's not true; age and wisdom are quite related, just not directly causal. They are correlated. Older people are generally wiser, it's just not a guarantee, nor is it impossible for young people to be wise, but it's certainly far less common. With age comes experience and with experience, wisdom has fertile ground to grow, though it doesn't always.
"neither general nor personal wisdom have a positive linear relationship to age... age is not only not related to personal wisdom (as is the case for general wisdom) but even negatively related..."
[Mickler & Staudinger (2008), Sneed & Whitbourne (2003)]
Source: The Scientific Study of Personal Wisdom: From Contemplative Traditions to Neuroscience
The gorilla example is more to the point. Google pushed a quick fix and couldn't fix it. So they blocked the gorilla tag altogether. Why? Because a data scientist couldn't figure out how to fix the problem properly and deploy a solution. To me, that says quite a bit about the current realities of machine learning systems.
1) An AI researcher decides that problem X is hard enough that solving it represents intelligence
2) Researcher develops an algorithm to solve X
3) Now that we have that algorithm, it's no longer an AI problem
When you teach a computer to play chess, it's just depth first search with a few heuristics added in. When you write an expert system to evaluate mortgage loans, it's just following a if-then script.
Machine learning techniques though, seem to to have stayed solidly in the AI camp.
It sounds to me like this 'ink drop' is a metaphor to explain some state-of-the-art dimensionality reduction technique. Does anyone know the common name of this technique?
It isn't really all that magical or mysterious.
In the Ilya Sutskever Talking Machines podcast he describes exactly how it is "magic". He talks about how there is no theoretical basis to think that a deep neural network should converge, and prior to around 2006 the accepted wisdom was that networks deep enough to outperform other methods of machine learning were useless because they couldn't be trained.
And then they discovered how to initialize the random starting values in such a way that they do converge - for reasons no one really understands.
I've heard a lot of people claim there is no magic, but I tend to think Sutskever knows a little bit more than most people about deep neural networks.
This "magic" of converging networks reminds me of how ensemble methods, such as random forests, are effective but people aren't sure why. There's certainly (AFAIK) no theoretical grounds to say that a bunch of random decision trees should yield universally good results.
MNIST is a handwritten digit database. Each 784-pixel image (28x28) corresponds to a digit from 0 to 9. As a pure mathematical construction, there are at most 2^784 inputs possible, and a small number of possible outputs.
So if you have 784 completely different ways of analyzing the image, and you combine them in the right ways, you will get roughly an approximation of an answer. This is a tautology if the 784 ways are "the value of each pixel" and the combinations are "magic", but if you have more "intelligent" combinations you should have combinations that are less magic. And in this case, since humans can generally determine the digit value from a 7-light display, it seems reasonable that there exists some way to have "intelligent" combinations such that they combine to form a neural network that solves the problem of digit identification.
And that (still hand-wavy) explanation can also plausibly describe how a human would describe identifying a number. If I ask you "why is this a 1 and not a 3", you might say "because it's straight" or "because it's narrow" or "because it doesn't have a point in the middle" or any number of other descriptions of the object. So you can envision a 2-layer network where the middle layer calculates this (and due to the structure of images, in practice it might better be a 3 or 4-layer network. but the important point is that the search algorithms don't rely on it or you knowing what these middle layers are ahead of time)
Which only leaves the question of how "neural network learning" is supposed to find this. And there are a few heuristics which combine to (in practice) be a very effective search. We have back-propagation (which is much easier with automatic differentiation), so we can adjust the entirety of the network based on the output. (and it's an axiom that if you have a lot of things, they will be similar in the ways they are the same, and different [hopefully in some regular way] in the ways they are not the same). We have drop-off, where we attempt to prune connections that are irrelevant. We can add new connections to see if they are relevant. We can do any number of hill-climbing algorithms on the output of the fitness function. And, as a valid search algorithm, it tends to converge to a valid result.
Obviously none of this is at all rigorous, but if you know the math here enough I don't think you're asking the questions in this article.
Interesting intuitions; but if by "ways of analyzing the image" you mean functions, there are an awful lot more than 784...
Maybe start with a minimal implementation 'bootstrap protozoa' that can evolve to highly complicated forms.
Then realised that since Machine Learning processes takes a long time that would take forever making auto evolving computer life forms a long way away.
Although if it were possible, in the same way we became sentient based on very simple inputs, that could also mean computers, with a sufficiently complex process, could too...
it might not be rigourous, but for standard, deep neural networks its intuitively obvious enough that you can reinvent the idea from scratch in your bedroom in a time before the internet just by getting a vague description of the idea and thinking about how it /could/ work.
proving convergence may be difficult, but its not particularly challenging to see why it happens imo. :/
a lot of the things the article points at are utterly irrelevant to the subject. e.g. sigmoid functions, depth of networks.
i think there are some bold baseless claims here - for instance linearly interpolating data points is pretty simple as a way to approximate a function, and its not hard to see how neurons can provide this with linear activation functions and some very naive back propagation. (e.g. evenly dividing error and correcting weights towards the correct result)
if this didn't converge /that/ would be surprising.
Each layer of neuron activations is a new representation of the data - produced by the weighted connections.
Function compositions of linear functions are still only a linear function.
Each neuron in a layer sums it's weighted inputs this summation is a non-linearity that allows layers to be composed - function composition.
This is famously expressed in Minsky and Papert's 1968 'Perceptron': a single layer of network weights is incapable of learning XOR.
One analysis is that Neural nets transform the shape of the data until a single line on a twisty high dimensional manifold produces the desired distinction.
A single layer network is a universal approximator and a net can be trained or distilled from another net - but deep nets are overwhelmingly better at the initial learning and discovery.
Neural nets have been related to Kadanof's spin glasses, suggesting learning is alike to phase transitions or sand pile collapse where small local changes can produce profound global changes.
Generally when training nets the learning initially learns a lot very quickly, most of the error vanishes at the start.
Word2Vec demonstrates that nets learn very powerful representations, that word2vec vector algebra is semantic points to unexpectedly powerful representations.
Similar semantic vector math can be performed on images and the same vectors can translate modalities, e.g. text to images.
Natural Evolution produces efficient solutions.
I propose the successes of deep learning so far are partially explicable because they are working within human culture and perception, relearning our very efficient solutions - akin to distilling a deepnet into a shallow one.
This hypothesis will be tested if embodied deep neural nets using re-inforcement learning discover their own efficient solutions to performing tasks in the real world - robotics.
IMHO Peter Abeel's deep net, learning to robustly output robot motor torques directly from camera pixels will show if embodied deep nets can do discovery rather than relearning what we know. http://www.cs.berkeley.edu/~pabbeel/research_rll.html
Similarly, Bayesian statistics (a.k.a. probabilistic programming, graphical models, Bayes networks etc.) has a rock solid mathematical foundation – both the statistics behind it and the sampling algorithms (MCMC).
Of course, many mathematical statistical methods started off as hacks too. The invention of the L2 loss function, I imagine, must've gone a little like "alright, let's try to find the pattern here, let's draw the line through this series of points that minimizes the sum of the distance of each point to the line... hm, I don't like how this looks, let's try squaring the distances instead and then minimize that sum so we punish large deviations from our line a bit more... oh, that looks better."
We later realized this worked so well because it's max likelihood of the model y=ax+b+e where e is a gaussian error function. But that came much later.
Then again, on the other hand, "calculus works nicely" and "it's simple" probably are good spooky directors toward useful models anyway.
|y - wx| isn't differentiable but it is subdifferentiable. So the derivative is defined at every point except where y = wx, and in that case you're more or less fine if you just pretend that the derivative is zero at that point.
One reason why squaring is preferred is that the derivative has a nice closed form, which can be used to find closed form solutions.
Another is that people like to fit their models with gradient descent, and gradient descent has better guarantees for strongly convex loss functions. It also works better in practice. Intuitively: if you try to minimize x^2 by gradient descent, you have the largest gradients when you're far from the minimum. If you try to minimize |x|, then your gradient is always either +1 or -1, making it harder to converge around the minimum. See the Huber loss for a strongly convex relaxation of the absolute value function.
In 'machine learning', structured learning by Vapnik and co (theory behind SVM), has a beautiful and strong mathematical underpinning. But in general, it is true we don't really have a good understanding of why learning algorithms work. The very notion of generalization to unobserved data is not well understood (I like D. Wolpert papers on that topic).
It has been a while since I read those papers, there may better references.
Gal and Ghahramani "We show that a neural network with arbitrary depth and non-linearities, with dropout applied before every weight layer, is mathematically equivalent to an approximation to the probabilistic deep Gaussian process" http://arxiv.org/pdf/1506.02142.pdf
There have been many hierarchical networks in Bayesian statistics. Hierarchical Dirichlet Process, hierarchical beta process, Pitman-Yor process, Gaussian process, nested, dependent, translated, generalized versions of the Chinese Restaurant process, or the Indian Buffet one. The deep part isn't new.
People seem to talk about auto-encoders a lot. Its very definition introduces hidden variables about which very little is defined. It corresponds with some kind of prior on what representation should be used by the network. This is studied in Bayesian compressive sensing. http://machinelearning.wustl.edu/mlpapers/paper_files/icml20...
4.) Layered training.
Layers are trained one by one. Doesn't really ring a bell. Maybe it corresponds to some MCMC method I'm not aware of. There are often inner and outer loops in MCMC and I guess something fundamental can be told about the right moments to switch from inner to outer. However, that does not correspond to learning low-level features first.
More like so we can differentiate easy. Then if anyone asks, you wave your hands and mumble something along these lines :)