Do tell me, since I'm writing a paper on a related topic, which current ML models can "pattern match" to recognize or generate multimodal (ie: visual, auditory, and tactile) percepts of cats, in arbitrary poses, in any context where cats are usually/realistically found?
Or did you just mean that the "cat" subset of Imagenet is as "solved" as the rest of Imagenet?
We have this famous image showing progress over the last 5 years.
The latest generator in this list has very powerful latent spaces, including approximately accurate 3D rotations.
We have similarly impressive image segmentation and pose estimation results.
Because you mentioned it, note that models that utilize multimodal perception is possible. The following uses audio with video.
For sure, these are not showing off the full breadth of versatility that humans have. I can still reliably distinguish StyleGAN faces from real faces, and segmentation still has issues. These all have fairly prominent failure cases, can't refine their estimates with further analysis like humans can, and humans still learn much, much faster than these models.
However, note that (for example) StyleGAN has 26 million parameters, and with my standard approximate comparison of 1 bit:1 synapse, that puts it probably somewhere around the size of a honey bee brain. Given such a model is already capturing sophisticated models fairly reliably using sophisticated variants of old techniques without need of a complete rethink, and the same cannot be said for (eg.) high-level reasoning, where older strategies (eg. frames) are pretty much completely discredited, “not all that difficult” seems like a pretty defensible stance.