I only read the abstract, so I'm sure this is a basic/dumb question... but if you don't label images as faces or not, what makes it a face detector? :) How do you get an elbow detector or a butt detector out of the same algorithm?
Show it a zillion pictures. then show it a face and see what gets activated. that's your face detector. show it an elbow or butt, and see what gets activated, that's your elbow or butt detector.
It automatically creates a set features that you can then use a final layer of machine learning to get what you want.
In machine learning, normally you have to create a set of features (call feature engineering - basically think algorithms to better represent your data). The amazing thing about deep learning is that the computer does this for you!
You just need a few 10s/100s face/nonface images - same for 20,000 other objects - this is called fine-tuning.
For more, andrew ng, geoff hinton, yann lecun have given talks on this at google and they are up on youtube.
This paper is actually more interesting: it automatically learns some "neuron" which its firing represents a detected face, without any supervise technique. It shows the possibility to extract complex information solely from data.
Using "reward" or say supervised training is easier and (near certainly) often gives better result, but unsupervised is more interesting as a research result, it tells that we can actually extract very high level information from data itself, using some "obvious" rules (such as linearly mix adjacent pixels and give as sparse-"laplace distribution like" results as possible). It is important because it proves that we may simulate brain functionality without knowing exact structure of brain (as we know brain is complex), but by analysis the data it processes using lots of simple structure instead.