The famous AlexNet  that blew away the ImageNet competition in 2012 contained 8 layers; more recent networks have even more.
But you're making it sound like shallow networks are a thing of the past. I would compare this to the field of NLP, where it seems we don't have a good general idea what to do with deep networks, and the things that work so far are mostly shallow.
word2vec is one layer plus a softmax. GloVe is similar to a one-layer network. char-rnn seems to do as well with two layers as it does with three. All the gradient-descent classifiers out there are equivalent to one-layer networks.