Even if you get the perfect training data, the state-of-the art neural networks will not be 100% reliable. Convolutional neural networks work by recursively recognizing shapes. In theory, the first layer should match the picture on something like this and conclude "it's probably an eye":
o o X o o
o X X X o
o X X X o
o o X o o
(above, "X" denotes low relative brightness, "o" denotes high relative brightness).
The next convolutional layer is supposed to see something like this and conclude it's a face:
(eye) (not a bumper) (eye)
(not a door) (nose) (not a wheel)
(not a shovel) (mouth) (not a window)
That's in theory. In practice, each input to the second layer will be a vector of probabilities. Like 90% eye, 80% wheel, 1% door. The output will also look like a vector of probabilities, and the layer is supposed to figure out a linear formula matching inputs to outputs that results in the best statistical outcome. So it can very easily come up with kernels like this:
* A: (an eye or a wheel) and maybe a bumper to the right of it
* B: (an eye or a wheel) and maybe a bumper to the left of it
* C: (two wheels) side-by-side
In this case, the baby face will be encoded as A together with B and definitely not C. Because each next level produces a linear combination of the previous level's outputs, the "or a wheel" parts will get nicely canceled out as long as the "2 wheels" kernel input is a low value, concluding that it's a face with a confidence of 99%. Until the light falls from a different angle and the baby holds a round lego brick near their face, suddenly yielding 80% confidence that it's a car.
Drawing parallels to organic life, neural networks are good at replicating inborn instincts. Like animals instantly recognizing predators, food or poisonous plants. But they are still far away from having a situational awareness needed to make judgement calls in most real-world situations.
The next convolutional layer is supposed to see something like this and conclude it's a face:
That's in theory. In practice, each input to the second layer will be a vector of probabilities. Like 90% eye, 80% wheel, 1% door. The output will also look like a vector of probabilities, and the layer is supposed to figure out a linear formula matching inputs to outputs that results in the best statistical outcome. So it can very easily come up with kernels like this:* A: (an eye or a wheel) and maybe a bumper to the right of it
* B: (an eye or a wheel) and maybe a bumper to the left of it
* C: (two wheels) side-by-side
In this case, the baby face will be encoded as A together with B and definitely not C. Because each next level produces a linear combination of the previous level's outputs, the "or a wheel" parts will get nicely canceled out as long as the "2 wheels" kernel input is a low value, concluding that it's a face with a confidence of 99%. Until the light falls from a different angle and the baby holds a round lego brick near their face, suddenly yielding 80% confidence that it's a car.
Drawing parallels to organic life, neural networks are good at replicating inborn instincts. Like animals instantly recognizing predators, food or poisonous plants. But they are still far away from having a situational awareness needed to make judgement calls in most real-world situations.