Even if you get the perfect training data, the state-of-the art neural networks ...

Even if you get the perfect training data, the state-of-the art neural networks will not be 100% reliable. Convolutional neural networks work by recursively recognizing shapes. In theory, the first layer should match the picture on something like this and conclude "it's probably an eye":

  o o X o o
  o X X X o
  o X X X o
  o o X o o

(above, "X" denotes low relative brightness, "o" denotes high relative brightness).

The next convolutional layer is supposed to see something like this and conclude it's a face:

  (eye)        (not a bumper)    (eye)
  (not a door)    (nose)     (not a wheel)
  (not a shovel) (mouth)   (not a window)

That's in theory. In practice, each input to the second layer will be a vector of probabilities. Like 90% eye, 80% wheel, 1% door. The output will also look like a vector of probabilities, and the layer is supposed to figure out a linear formula matching inputs to outputs that results in the best statistical outcome. So it can very easily come up with kernels like this:

* A: (an eye or a wheel) and maybe a bumper to the right of it

* B: (an eye or a wheel) and maybe a bumper to the left of it

* C: (two wheels) side-by-side

In this case, the baby face will be encoded as A together with B and definitely not C. Because each next level produces a linear combination of the previous level's outputs, the "or a wheel" parts will get nicely canceled out as long as the "2 wheels" kernel input is a low value, concluding that it's a face with a confidence of 99%. Until the light falls from a different angle and the baby holds a round lego brick near their face, suddenly yielding 80% confidence that it's a car.

Drawing parallels to organic life, neural networks are good at replicating inborn instincts. Like animals instantly recognizing predators, food or poisonous plants. But they are still far away from having a situational awareness needed to make judgement calls in most real-world situations.