Hacker News new | comments | ask | show | jobs | submit login

> In practice, neural networks use only two or three layers...

The famous AlexNet [1] that blew away the ImageNet competition in 2012 contained 8 layers; more recent networks have even more.

[1] http://www.cs.toronto.edu/~fritz/absps/imagenet.pdf

Yea it's more like the opposite, in theory you only need 1 hidden layer to get the same capacity in the network as with more hidden layers, but in practice, since auto encoders, many hidden layers gradually decreasing in size are easier to get to from local energy minimums to globally low energy minimums

Yes, people took the Universal approximation theorem[0] as evidence that they only need 1 layer, but there is zero guarantee of efficiency. A single hidden layer may mean many magnitudes more neurons needed over a n-hidden layer, n > 1 network, which could cause an unrealistic training time. Having multiple layers can reduce this training time with an optimal structure.


Nobody ever took that theorem seriously. Deep nets were around since the 90s.

I've seen people with passing knowledge of NNs throw it around sometimes and its also referred to in a lot of literature as one reason for needing to parallelize neural networks (which I've been reading a lot on, due to a project), even if its not hugely important.

It's not hugely important because it tells us little of practice use. I mean, k nearest neighbors, given infinite data, can model any function as well. In practice, single layer neural nets are not very useful and don't do a good job of learning feature representations.

I wrote a comment on the article's page, saying this. They have now added a correction.

I understand it as _classic_ neural networks as opposed to deep networks. Not a good choice of words, though.

That's mostly tautological: the networks you call "deep" are the ones that use more layers. Unless you need funding or media attention, in which case you call everything "deep".

But you're making it sound like shallow networks are a thing of the past. I would compare this to the field of NLP, where it seems we don't have a good general idea what to do with deep networks, and the things that work so far are mostly shallow.

word2vec is one layer plus a softmax. GloVe is similar to a one-layer network. char-rnn seems to do as well with two layers as it does with three. All the gradient-descent classifiers out there are equivalent to one-layer networks.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact