I read over it, but there are a couple of problems with that. Firstly, I never said that computers understand the images, just that the recognition is better than human. When they get it wrong, they do tend to get it wrong in ways that look ludicrous to people. Secondly, Stanford didn't win the 2014 ImageNet challenge: GoogLeNet did.
Page 31 states: "Annotator A1 evaluated a total of 1500
test set images. The GoogLeNet classification error on
this sample was estimated to be 6.8% (recall that the
error on full test set of 100,000 images is 6.7%, as shown
in Table 7). The human error was estimated to be
5.1%."
There are some bits there where GoogLeNet did better than human, but it was usually on specific classifications (what breed of dog is that, for example). Generally, a human is slightly better.
It's worth noting, however, that the error rate was halved from 2013's winners. I reckon the 2015 error rate will beat human, and in following years there will just be no contest. Just like chess programs have gone far beyond human capabilities.
And, of course, with cars we're talking about having multiple images plus depth perception, which makes it a lot easier for the machine.