Yes. These systems are working on a 3D problem in a 2D world. They have a hard time with situations involving occlusion. A newer generation of systems will probably deduce 3D models from 2D images, build up a model space of 3D models, generate 3D models, and then paint and animate them. That's how computer-generated animation is done today, which humans driving. Most of those steps have already been automated to some degree.
Early attempts to do animation by morphing sort of worked, and were prone to some of the same problems that 2D generative AI systems have. The intermediate frames between the starting and ending positions did not obey physical constraints.
This is a good problem to work on, because it leads to a more effective understanding of the real world.