I don't think it's possible to do without learning a hidden 3D model from 2D data.
It's nonsense to learn 2D projections directly from 2D projections, because small errors in the 2D projection can become catastrophic errors in the 3D model that humans use to interpret a 2D photo.
Learn maybe. But its easy to construct 3d information from 2d observations. See structure from motion, shape from shadow, Voxel carving, and even orbit determination.
It's nonsense to learn 2D projections directly from 2D projections, because small errors in the 2D projection can become catastrophic errors in the 3D model that humans use to interpret a 2D photo.