Alas, the proliferation of different kinds of neural net architectures that work, over the past few years, is a sign that we lack a decent unifying theoretical framework that can explain, from fundamental principles, what works, what doesn't, and why. We're not there yet.
Having different algorithms for different purposes is fine. For instance, autoencoders can do unsupervised learning, GANs can learn generative models that sample from the entire distribution, recurrent neural networks can handle time series, etc. Also while there are many different types of networks, research has shown which ones work and which ones don't. Few people pretrain autoencoders or bother with RBMs anymore, for instance. And I think we have good theoretical reasons why they aren't as good.
But to continue the analogy to computer science, imagine all the different kinds of sorting algorithms. They will each work better or worse based on how the data you are sorting is arranged. If it's already sorted in reverse, that's a lot different than if it's sorted completely randomly, or if it was sorted and then big chunks were randomly rearranged.
There's no way to prove that one sorting algorithm will always do better than another, because there's always special cases where they do do better. The same is true of neural networks, it's impossible to formally prove they will work, because it depends on how real world problems are distributed.
I highly recommend this book to anyone who has access to a library and is interested in seeing algorithms from the thousand-mile-high point of view: https://www.amazon.com/Algebra-Programming-Prentice-Hall-Int...
I suppose the point of my comment is that theoretical computer science is actually a field with a lot of unifying theories that approach computation in coherent ways. Applied computer science is much, much messier because it is interested in the particularities and flaws of real world computational models and getting practical results now, leaving explanations to come later.
There are unifying theories of inference for AI, but they don't really cover deep neural networks. There are a few tantalizing hints that deep learning is intimately related to profound concepts in physics (renormalization) and functional programming.
We can even proof that there is no optimal sorting algorithm. "For every sorting algorithm x exists an input i for which an sorting algorithm y exists, so that y sorts i faster than x" is provable
Which is a shame. I think there are unifying frameworks in CS but think we could use more and I think we would benefit if, when people looked at a body of knowledge and saw just a list of techniques, they would say "hey, this could benefit from a more unified treatment".
I normally hate when this happens, so i have to apologise.
Comparing the input against pre-defined lists with a generic fallback algorithm is a sorting algorithm, but as you say it's rather 'trivial'.
However, I read klodolph's use of "permutation" as being more subtle than the above. Rather than splitting inputs into "known" (use pre-sorted answer) and "unknown" (use generic fallback), instead you write a valid sorting algorithm which treats all inputs in the same way, but you arrange all of the possible choice-points to perform optimally for your pre-selected inputs.
For example, imagine a quicksort which chooses exactly the right "random" pivots for particular inputs. It's not hard-coded in a particularly 'trivial' way; the fact that it works optimally for those inputs is a (carefully engineered) 'coincidence'.
In fact, some quicksort implementations randomly shuffle the input first, to lower the probability of malicious input triggering the worst-case O(n^2) behaviour. What if such a shuffle just-so-happend to "cancel out" the disorder of particular inputs, causing them to become sorted before the quicksort even begins? It's debatable whether that would count as a 'coincidence' or 'checking if the input matches a given list'.
We could even avoid the O(log n) overheads of average-case quicksort if we used that shuffling trick with bubble sort, to get its O(n) best-case performance.
But that's not my point. My point is that CS people can actually formally prove things about their algorithms. They can make reasonable assumptions like "if the data is randomly distributed". Machine learning can't make strong assumptions like that, and they can't formally prove anything. Expecting anything like formal proofs for machine learning algorithms is unreasonable because of the no free lunch theorem.
The same is true for sorting algorithms, if you remove the assumption that the data is randomly distributed. If you don't know how data is distributed in a real world problem, then there is no way of proving what algorithm will do better. You just have to test them empirically.
Maybe that's because reality is high-dimensional but our brains can only classify interesting things on or near a lower-dimensional manifold.
As an example merge sort established an invariant in its divide step which is exploited on the conquer step. Moving this idea forward comparison based sorting establishes some version of this invariant. This argument seems to make sense to me though.
Yea I think that's probably true as well. Do you consider the VonNeumann architecture to the the unifying theoretical framework? If not then what?
Interesting side-note: we are currently unable to prove that there is nothing more powerful than a turing machine, i don't know if its even provable (that smells like something undecidable).
Since this is currently my seminar-research topic: In the Zermelo–Fraenkel set theory there is only one Urelement.
I can prove the non-existence of another Urelement not equal to the Urelement.
Since i only care about Zermelo–Fraenkel, i can define Urelemente as things inside the Zermelo–Fraenkel set theory that are not sets themselves but elements of a set. And there is only one element that satisfies this definition. Every other thing is not a Urelement, since it does not satisfy the definition.
I don't understand the proof.
If A is an Urelement, and B is an Urelement, ... I don't see a contradiction in A is an element of C, and B is not an element of C?
I mean, I defer to your research, but I do not understand...
I'm not sure I understand.
I would think that disproving the nonexistence of something would be the same as demonstrating the existence of something. That seems doable, on a sense at least.
And I would think that disproving the existence of something could be done by deriving a contradiction from the assumption that the thing exists. This also seems possible.
I'm guessing I am misunderstanding you in some way. Can you help me understand?
Could you elaborate on this?
I ask, because it's not obvious to me that this is true. Physics, for instance, has had proliferations of new "particles" both because they lacked a solid understanding of a phenomena (but knew how to tweak the setup to vary the result) and because they had a new, solid theoretical framework that was able to accurately predict whole classes of new particles. (There's a bunch of unstable particles related to the electroweak interaction that are from this, for instance.)
As someone who has reached "journeyman" at the topic, it's sort of hard to tell if people are just tweaking the setup to get new experimental results they can use to brute force theory building or whether they're setting out on a specific research path on solid theoretical grounds. (I expect that it's a mix of the two, with different shops being different blends, to be honest.)
Theoretically, we should be able to look at a bunch of different geometries and explain what works and what doesn't, just from the original rules and the geometry. The fact that we can't, and usually experimentation is necessary to get a good idea about the strengths and weaknesses of different geometries, is a signal that we don't really understand the system. I'm not aware of any theory of the type: give me an ordered graph with typed nodes and I'll give you the types of data this is good at modeling, how quickly it will converge, if you need dropout or layer-skip or something else in order to make it converge well, etc.
In your physics analogy: we can write out various properties of composite particles purely from theory: i.e. Mesons decay pathways and allowed compositions, Baryons decay pathways and allowed compositions, etc, just knowing about the core 17 standard model particles. We know about a few core nodes of neural nets, but can't use that to predict the behavior of the whole net.
I mean, if physics publishes papers when they find things they expected to find (at least, the first instance of each kind of thing and thereafter, any novel improvement in their production), why wouldn't machine learning theorists?
That's precisely what I can't tell: is a paper that's "We found that architecture X performed task Y with score Z" the same as "We found particle X at energy level Y and have no idea what it is" or "We found particle X at energy level Y just as we were expecting"?
And a lot of the papers are "We changed architecture X to now have feature Y and got the expected improvement Z", which I doubly can't tell how expected the improvement was and how systemically improvements are being designed and implemented.
Right now, we're in the early stages of engineering, like architecture before modern physics: we know some things that work, and we have some good intuitions about why, but there's little solid foundation to tell us how to proceed. We take some loose inspiration from nature (replace tanh functions with rectifiers to mimic action potentials in the brain, build convolutional networks similar to the retina) and find that it's more effective, sometimes, and sometimes less. We also just try a lot of stuff in hopes that something will stick. It's not as if there are some real, true neural networks out there, waiting like particles to be discovered: everything in the neural network zoo was built by hand, maybe inspired by nature, and saved because it works well, or is at least interesting; other architectures are forgotten. What we'd like is engineering principles that we can understand, so trying to make a neural network better at function x is just a matter of adding more units here or editing a function there, not venturing out into the dark again. (Such a reductive set of explanations may not exist for cognition, which really worries people who liked computers for their predictability.)
I suppose the alternative, more scientific/less engineering approach would be to treat benchmark scores as experimental observations, and try to form predictive models which take in descriptions of networks and output predicted benchmark scores. In the architecture analogy, this would be like modelling the strength of various materials and shapes. If good predictive models are found, they can be used to design networks which are predicted to have desirable scores, in the same way that buildings can be designed based on predictions of how the materials and geometry will behave.
Of course, to be more useful we'd also want to take into account things like resource usage, training time, etc. and the models themselves must be constrained somehow, to avoid trivial solutions like "run the given network and see how it behaves, give that as our prediction".
"Tuning hyperparameters of learning algorithms is hard because gradients are usually unavailable. We compute exact gradients of cross-validation performance with respect to all hyperparameters by chaining derivatives backwards through the entire training procedure. These gradients allow us to optimize thousands of hyperparameters, including step-size and momentum schedules, weight initialization distributions, richly parameterized regularization schemes, and neural network architectures. We compute hyperparameter gradients by exactly reversing the dynamics of stochastic gradient descent with momentum."
But it's not feasible yet. Once it is, you can imagine collapsing the whole neural net zoo: you merely specify the input/output type/dimension and then it starts gradient-ascent over all the possible models as tweaked by internal hyperparameters.
Hell, ANN itself was widely the last choice for ML problems until a couple of years back and now it seems like the first choice go-to.