> D-REX proposes a really clever trick to get around not having any reward labels at all, even when the demonstrator is suboptimal: Given a suboptimal policy... add variable amounts of noise to its actions. Assume that adding noise to a suboptimal policy makes it even more suboptimal... Train a ranking model to predict which of two trajectories has a higher return. The ranking model magically extrapolates to trajectories that are better
What strikes me about this is the assumption (adding noise to a policy makes it worse) goes completely against evolutionary approaches to AI (that we can look for improvements by adding noise).
The two ideas are mostly compatible (and neither assumption always holds):
(Evolutionary) If you generate enough perturbations then some of them are better.
(TFA) If you generate perturbations then most of them are worse.
In the evolutionary case you also explicitly design your model and algorithm to try to generate good perturbations, so the two ideas aren't necessarily directly comparable anyway.
It seems like this stuff would work better if they combined things like 3d modeling. (maybe built on Nurbs or CSG or something) with the deep learning and also just tried to do less in one step or even with one model. Or maybe it's a more flexible neural model at the core but it's still compositional and more precisely trained on constituent parts. Train the system to nail down core concepts first and then build off of that.
Not saying it's easy to do that. But take as an example a robot operating indoors. You may want it to do many domestic tasks and so have a degree of generalization, but it doesn't need to be able to model landscape scenery or birds. And the environment has quite a lot of regular shapes in it since it's man made. With so many regular shapes, flat walls, rectangles and straight edges, the vision system should first be able to nail the representation of that. Once you have a solid understanding of indoor 3d reconstruction, the structures used there can be leveraged for things like imagining described scenarios.
I remember a Nature paper where they did essentially this (I can't remember enough to look it up quickly). There were two parts to the model, one of which was a simpler "transformation" part involving little optimization per se, the other of which was a more traditional DL model.
What strikes me about this is the assumption (adding noise to a policy makes it worse) goes completely against evolutionary approaches to AI (that we can look for improvements by adding noise).