Hacker News new | past | comments | ask | show | jobs | submit login
The Neural Network Zoo (asimovinstitute.org)
358 points by hgarg on Oct 20, 2016 | hide | past | favorite | 39 comments

Good job.

Alas, the proliferation of different kinds of neural net architectures that work, over the past few years, is a sign that we lack a decent unifying theoretical framework that can explain, from fundamental principles, what works, what doesn't, and why. We're not there yet.

Look at computer science. They have invented so many different algorithms that work! This is a sign that computer science lacks a decent unifying theoretical framework.

Having different algorithms for different purposes is fine. For instance, autoencoders can do unsupervised learning, GANs can learn generative models that sample from the entire distribution, recurrent neural networks can handle time series, etc. Also while there are many different types of networks, research has shown which ones work and which ones don't. Few people pretrain autoencoders or bother with RBMs anymore, for instance. And I think we have good theoretical reasons why they aren't as good.

But to continue the analogy to computer science, imagine all the different kinds of sorting algorithms. They will each work better or worse based on how the data you are sorting is arranged. If it's already sorted in reverse, that's a lot different than if it's sorted completely randomly, or if it was sorted and then big chunks were randomly rearranged.

There's no way to prove that one sorting algorithm will always do better than another, because there's always special cases where they do do better. The same is true of neural networks, it's impossible to formally prove they will work, because it depends on how real world problems are distributed.

Type theory is a unifying theory of program semantics. It is one among many, each of which is a facet of one whole. In particular, the theory of algebraic and coalgebraic data types is a unifying theory for computation that allows one to derive efficient algorithms automatically. It does not encompass all algorithms design by a long shot, it is actually quite niche, but as a coherent mathematical theory that gets to the essence of computation, it is very promising.

I highly recommend this book to anyone who has access to a library and is interested in seeing algorithms from the thousand-mile-high point of view: https://www.amazon.com/Algebra-Programming-Prentice-Hall-Int...

I suppose the point of my comment is that theoretical computer science is actually a field with a lot of unifying theories that approach computation in coherent ways. Applied computer science is much, much messier because it is interested in the particularities and flaws of real world computational models and getting practical results now, leaving explanations to come later.

There are unifying theories of inference for AI, but they don't really cover deep neural networks. There are a few tantalizing hints that deep learning is intimately related to profound concepts in physics (renormalization) and functional programming.

I don't think that this is a sign that computer science lacks a decent unifying theoretical framework.

We can even proof that there is no optimal sorting algorithm. "For every sorting algorithm x exists an input i for which an sorting algorithm y exists, so that y sorts i faster than x" is provable

I think the GP was being sarcastic and casting doubt on the need for or importance of a unifying theoretical framework in CS.

Which is a shame. I think there are unifying frameworks in CS but think we could use more and I think we would benefit if, when people looked at a body of knowledge and saw just a list of techniques, they would say "hey, this could benefit from a more unified treatment".

well thats a typical case of editing without reading the answer if was referring to first. I read Andrews answer and then changed my comment without re-reading the answer i was referring to.

I normally hate when this happens, so i have to apologise.

Well, yes, but that's not terribly illuminating because your algorithm Y just has to encode the permutation for I and then check that the result is sorted before falling back to another algorithm. So in another sense you can say that this is true for any problem where the NTIME complexity is less than the DTIME complexity, which is an incredibly large set of problems, because you can just encode an oracle into your machine for a chosen input.

well, a sorting algorithm has to sort arbitrary input, so def sort(ignore) = [1,2,3] does not count. The details are complicated and i am not sure if i am ready to understand the proof, but i think you can even proof that there is no optimal algorithm even when you rule out trivial things like checking if the input matches a given list and returning a pre-sorted one or else sort manually.

Clearly, as you say, ignoring the input and producing a single output is not a sorting algorithm.

Comparing the input against pre-defined lists with a generic fallback algorithm is a sorting algorithm, but as you say it's rather 'trivial'.

However, I read klodolph's use of "permutation" as being more subtle than the above. Rather than splitting inputs into "known" (use pre-sorted answer) and "unknown" (use generic fallback), instead you write a valid sorting algorithm which treats all inputs in the same way, but you arrange all of the possible choice-points to perform optimally for your pre-selected inputs.

For example, imagine a quicksort which chooses exactly the right "random" pivots for particular inputs. It's not hard-coded in a particularly 'trivial' way; the fact that it works optimally for those inputs is a (carefully engineered) 'coincidence'.

In fact, some quicksort implementations randomly shuffle the input first, to lower the probability of malicious input triggering the worst-case O(n^2) behaviour. What if such a shuffle just-so-happend to "cancel out" the disorder of particular inputs, causing them to become sorted before the quicksort even begins? It's debatable whether that would count as a 'coincidence' or 'checking if the input matches a given list'.

We could even avoid the O(log n) overheads of average-case quicksort if we used that shuffling trick with bubble sort, to get its O(n) best-case performance.

Did anyone ever publish an academic paper about a sorting algorithm after merely showing that it beats mergesort time by several percent on couple of synthetic datasets?

They certainly should be able to, if they can show their algorithm actually does better on real world datasets.

But that's not my point. My point is that CS people can actually formally prove things about their algorithms. They can make reasonable assumptions like "if the data is randomly distributed". Machine learning can't make strong assumptions like that, and they can't formally prove anything. Expecting anything like formal proofs for machine learning algorithms is unreasonable because of the no free lunch theorem.

The same is true for sorting algorithms, if you remove the assumption that the data is randomly distributed. If you don't know how data is distributed in a real world problem, then there is no way of proving what algorithm will do better. You just have to test them empirically.

Would the references at https://en.wikipedia.org/wiki/Timsort count?

Houshalter: the no-free-lunch theorem assumes that all possible "real world" problems are equally probable. That does not appear to be true in theory (due to the laws of physics, e.g., see https://arxiv.org/abs/1608.08225 ) nor in practice (e.g., in complicated AI-type problems, the objects of interest always seem to lie on/near lower-dimensional manifolds embedded in a high-dimensional input space).

>...the objects of interest always seem to lie on/near lower-dimensional manifolds embedded in a high-dimensional input space.

Maybe that's because reality is high-dimensional but our brains can only classify interesting things on or near a lower-dimensional manifold.

Despite the difficulty of solving tasks humans are good at, we also want machines to solve tasks that are impossible for humans to solve, so I don't think this is really a general argument against no free lunch.

But one could argue that sorting is a fundamental property of, presumably as keys, the numbers themselves, and therefore any alteration on an algorithm is a play on how numbers are ordered?

As an example merge sort established an invariant in its divide step which is exploited on the conquer step. Moving this idea forward comparison based sorting establishes some version of this invariant. This argument seems to make sense to me though.

This is a sign that computer science lacks a decent unifying theoretical framework.

Yea I think that's probably true as well. Do you consider the VonNeumann architecture to the the unifying theoretical framework? If not then what?

I don't think there is a lack of a unifying theoretical framework. Both the turing machine and the lambda calculus are well understood and the foundations of computer science (and are isomorphic to each other).

I think the problem is more that a gap exists between the theoretical framework and practical programming. For example have fun trying to formally verify Javascript code!

Interesting side-note: we are currently unable to prove that there is nothing more powerful than a turing machine, i don't know if its even provable (that smells like something undecidable).

Philosophical nitpick, it's fundamentally impossible to disprove non-existence of an entity.

That may be the border of my english-understanding, but i can prove non-existence of certain things (are they entities?).

Since this is currently my seminar-research topic: In the Zermelo–Fraenkel set theory there is only one Urelement.

I can prove the non-existence of another Urelement not equal to the Urelement.

Since i only care about Zermelo–Fraenkel, i can define Urelemente as things inside the Zermelo–Fraenkel set theory that are not sets themselves but elements of a set. And there is only one element that satisfies this definition. Every other thing is not a Urelement, since it does not satisfy the definition.

I don't understand. Is that the proof?

I don't understand the proof.

If A is an Urelement, and B is an Urelement, ... I don't see a contradiction in A is an element of C, and B is not an element of C?

I mean, I defer to your research, but I do not understand...

My point is really about the impossibility of proving a universal negative. Russell's Teapot is instructive here.

You say it's impossible to prove a universal negative? I'd love to see a proof of that


I'm not sure I understand.

I would think that disproving the nonexistence of something would be the same as demonstrating the existence of something. That seems doable, on a sense at least.

And I would think that disproving the existence of something could be done by deriving a contradiction from the assumption that the thing exists. This also seems possible.

I'm guessing I am misunderstanding you in some way. Can you help me understand?

Methinks you had a double-negative too many. Your sentence says it's impossible to prove the existence of something. Unless that's what you had intended, in which case I don't see how that's useful philosophy for this discussion.

You're correct, I mis-typed and it should have been proven the non-existence.

Aka, "No free lunch theorem".

> the proliferation of different kinds of neural net architectures that work, over the past few years, is a sign that we lack a decent unifying theoretical framework that can explain, from fundamental principles, what works, what doesn't, and why

Could you elaborate on this?

I ask, because it's not obvious to me that this is true. Physics, for instance, has had proliferations of new "particles" both because they lacked a solid understanding of a phenomena (but knew how to tweak the setup to vary the result) and because they had a new, solid theoretical framework that was able to accurately predict whole classes of new particles. (There's a bunch of unstable particles related to the electroweak interaction that are from this, for instance.)

As someone who has reached "journeyman" at the topic, it's sort of hard to tell if people are just tweaking the setup to get new experimental results they can use to brute force theory building or whether they're setting out on a specific research path on solid theoretical grounds. (I expect that it's a mix of the two, with different shops being different blends, to be honest.)

Think about neural nets as emergent: simple nodes give rise to complicated dynamics. The dynamics are determined by the overall geometry of the system.

Theoretically, we should be able to look at a bunch of different geometries and explain what works and what doesn't, just from the original rules and the geometry. The fact that we can't, and usually experimentation is necessary to get a good idea about the strengths and weaknesses of different geometries, is a signal that we don't really understand the system. I'm not aware of any theory of the type: give me an ordered graph with typed nodes and I'll give you the types of data this is good at modeling, how quickly it will converge, if you need dropout or layer-skip or something else in order to make it converge well, etc.

In your physics analogy: we can write out various properties of composite particles purely from theory: i.e. Mesons decay pathways and allowed compositions, Baryons decay pathways and allowed compositions, etc, just knowing about the core 17 standard model particles. We know about a few core nodes of neural nets, but can't use that to predict the behavior of the whole net.

Well, a lot of the papers published in the field present results like "We designed a neural net to perform <task> and achieved X% accuracy." The design of the net is novel and interesting enough to merit its own publication. If there was some sort of theoretical framework, results like that would not be interesting, because presumably the theory would explain which NN architectures are good at different tasks and why. I think that we will get there eventually, but right now I don't we have enough data for patterns to emerge and hint at some sort of Theory.

But couldn't that just be, to use the language of my analogy, papers being published which confirm particle existence?

I mean, if physics publishes papers when they find things they expected to find (at least, the first instance of each kind of thing and thereafter, any novel improvement in their production), why wouldn't machine learning theorists?

That's precisely what I can't tell: is a paper that's "We found that architecture X performed task Y with score Z" the same as "We found particle X at energy level Y and have no idea what it is" or "We found particle X at energy level Y just as we were expecting"?

And a lot of the papers are "We changed architecture X to now have feature Y and got the expected improvement Z", which I doubly can't tell how expected the improvement was and how systemically improvements are being designed and implemented.

I think the issue here is that we're not discovering new architectures as one discovers particles, we're creating them. The creative space is infinite, and people are making subtle tweaks to neural network systems all the time. It's not science, it's engineering.

Right now, we're in the early stages of engineering, like architecture before modern physics: we know some things that work, and we have some good intuitions about why, but there's little solid foundation to tell us how to proceed. We take some loose inspiration from nature (replace tanh functions with rectifiers to mimic action potentials in the brain, build convolutional networks similar to the retina) and find that it's more effective, sometimes, and sometimes less. We also just try a lot of stuff in hopes that something will stick. It's not as if there are some real, true neural networks out there, waiting like particles to be discovered: everything in the neural network zoo was built by hand, maybe inspired by nature, and saved because it works well, or is at least interesting; other architectures are forgotten. What we'd like is engineering principles that we can understand, so trying to make a neural network better at function x is just a matter of adding more units here or editing a function there, not venturing out into the dark again. (Such a reductive set of explanations may not exist for cognition, which really worries people who liked computers for their predictability.)

Another issue with attempting to unify existing results is the focus on good performance, and the higher-level optimisation being performed by the researchers/implementors. This is partly because of the focus on engineering, as you say; I'd wager it's also due to a 'file drawer effect', where the emphasis is on achieving ever-higher benchmark scores, and that rewards tweaking of algorithms.

I suppose the alternative, more scientific/less engineering approach would be to treat benchmark scores as experimental observations, and try to form predictive models which take in descriptions of networks and output predicted benchmark scores. In the architecture analogy, this would be like modelling the strength of various materials and shapes. If good predictive models are found, they can be used to design networks which are predicted to have desirable scores, in the same way that buildings can be designed based on predictions of how the materials and geometry will behave.

Of course, to be more useful we'd also want to take into account things like resource usage, training time, etc. and the models themselves must be constrained somehow, to avoid trivial solutions like "run the given network and see how it behaves, give that as our prediction".

We have to hardwire architectures because we can't learn architectures yet, or to put it another way, backpropagation doesn't yet work on hyperparameters as well as on parameters. Hyperparameters should be learnable as in theory there's nothing special about them (it's models all the way down!) - a hyperparameter is merely a parameter we don't yet know how to learn - and this has been demonstrated: http://jmlr.org/proceedings/papers/v37/maclaurin15.pdf "Gradient-based Hyperparameter Optimization through Reversible Learning"

"Tuning hyperparameters of learning algorithms is hard because gradients are usually unavailable. We compute exact gradients of cross-validation performance with respect to all hyperparameters by chaining derivatives backwards through the entire training procedure. These gradients allow us to optimize thousands of hyperparameters, including step-size and momentum schedules, weight initialization distributions, richly parameterized regularization schemes, and neural network architectures. We compute hyperparameter gradients by exactly reversing the dynamics of stochastic gradient descent with momentum."

But it's not feasible yet. Once it is, you can imagine collapsing the whole neural net zoo: you merely specify the input/output type/dimension and then it starts gradient-ascent over all the possible models as tweaked by internal hyperparameters.

I assume you mean what works for what problem. I think the problem is these are moving so fast that what didn't work yesterday has been coerced into being the best option for, say, image classification today.

Hell, ANN itself was widely the last choice for ML problems until a couple of years back and now it seems like the first choice go-to.

I think a lot of the time we do have a good idea why a certain neural network is better or worse at various tasks. The core problem is that the ability of the network to work is is a function of the nature of the domain as much if not more so than the networks themselves and it is very hard to get an understanding of the shape of the problem space aside from post-hoc insights gained from what type of networks work well and which don't in that particular domain.

On the contrary, the concept of a "neural network" is beautifully cohesive. Many of the architectural differences are only iterative; the fundamental units are very similar. Like how there's a lot of diversity in biology, but the fundamental units are shared.

there is a decent unifying theoretical framework behind this: https://en.wikipedia.org/wiki/Infinite_monkey_theorem

The explanation about MC is too wrong to be useful. What on earth do the "nodes" (states) in a Markov chain have to do with the "nodes" (neurons) of a neural network?

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact