Imagine a simple scene with a single red point on white background. You are currently rendering a point on the left of your virtual screen and trying to update your camera parameters to match the target point at the right of the screen to register your camera parameters. Your gradients are zero. You are already screwed.
With more complex scenes, situation gets a little better, because you can get lucky, if you happen to have overlapping (between current and target) areas of similar colors then the gradient will tend to make them more overlapping and hopefully drag you towards the correct camera parameters.
Then you try to mitigate the initial fundamental problem with little tricks, to increase the likelihood that we are in the lucky configuration. You increase the blurriness to transform a single point in a little circle so you have more hope to have overlapping areas. Then you add some spatial pyramids (i.e. rendering with a scale of resolutions) to allow large movement across the screen instead of moving by one pixel by iteration. Then because it still often doesn't work
you add random restarting.
Among other issues are : -making object model non parametric makes the search harder.
-making the rendering process too "good" makes the search harder (you are considering advanced lighting effects i.e. details before global picture).
-Even using triangles is already a bad choice, it was an optimization of the rendering pipeline to make the forward process faster, here we are tackling with the inverse problem so these kind of premature optimization are now playing against us.
So what are better approaches ? There are various approach.
The old school way, are key points registering methods. You render your scene you look for key-points (eventually dense) with their local statistics, and you match with key-points of the target, which allows you to jump almost directly towards the solution, that you finish aligning with Iterative Closest Point algorithms.
The new school way, you use an invertible GAN model to render an image. This gives you a latent vector for your images. You do the operation you want to do on this latent vector, and then use the generator forward to get a new image. This latent vector contain the information about the scene so you can train a neural network (eventually with attention) to answer all questions about your scene parameters from this latent parameters.
More interesting though is when the renderer is part of a larger pipeline, and the loss is defined some other way (such as how well the agent is driving, in our DuckieTown simulator). Backprop through the renderer lets us do BPTT, which is very powerful, but we don't ever have to compare images, avoiding that non-differentiability issue. Another example would be the autoencoder-like setups you can use to bootstrap CV models.
It remains to be seen how effective these things are in practice, of course, but there are plenty of interesting directions.
There is some special structure to the rendering problem, that is probably tackled better. One inherent difficulty for efficient solving is working with multiple hypothesis at the same time (which differentiable renderer makes you miss). Also there is some special structure of the probability space which is "once you know the position of the camera it is easy to render, once you know the position of the objects it is easy to locate the camera". This means algorithm used for SLAM are a lot more able to tackle the problem.
Techniques like Rao-Blackwellized Particle Filters, for example are more appropriate,
It doesn't need a differentiable renderer for them and you can combine them with neural networks for example for pose estimation https://arxiv.org/abs/1905.09304 .
What I try to say is that there is not a lot of useful extra information in a differentiable solver, and the information doesn't flow well. These gradients are on the detail scale, when what matters in on the large scale.
Sure they can be used to stitch model together but that's an unreliable stitch, and you won't know if your agent is not driving because he can't take the right decision or your agent is not driving because he can't process the scene right.
The gradients will also flow a lot better if you use a RL-model based approach where you ask the agent model to render what it think it sees and train it to match with the non-differentiable renderer. These gradient will be a lot smoother and on a larger scale.
It's better for optimization purposes to train an as simple as possible differentiable rendering model to match the rendering of your complex renderer as a target (so in practice you never need its gradients), and use the gradient provided by this model to search for poses and camera parameters.
It's even better if you can train an invertible rendering model so you can infer the poses and parameters in single step. (The same way that style transfer is done).
The second question, if this is fully differentiate, how do you define the existence of an polytope in a scene in differentiable way? Is the input space for a given parameter optimization limited to material or phong / specular parameters?
For second question, the discussion mentions that there are lots of non-differentiable parameters and they don't handle that.