We learn objects representations by interacting with them over years in a multi modal fashion. Take for example a simple drinking glass: we know its material properties (it is transparent, solid, can hold liquids), its typical position (stay on a tabletop, upright with the open side on top), its usage (grab it with a hand and bring to mouth)...
We also make heavy use of the time dimension, as over a few seconds we see the same objects from different view points and possibly in different states.
Only after learning what a glass is can we easily recover its properties on a still 2D image.
So at least for learning (might be skippable at inference), it makes a lot of sense to me to have more than 2D still images.
All I'm saying is that even with stereo inputs, we're doing more than computing depth from the baseline between left/right images. Close one eye and you can still estimate relative objects positions, because you learned that roads are mostly planar and cars don't float but stand on the road. You know what the expected size of a car is compared to, say, a human, and if the car is visually smaller than the human, it must be more far away.
Lidar _also_ doesn't know what the glass feels like.
Yes I agree with you, lidar and most current vision sensors also suffer from this.