How can we make robotics more like generative modeling?

ilaksh · on July 30, 2022

It seems like these things could perform much better if the modeling was deliberately decomposed more and there was less emphasis on doing everything in one parallel step.

For example, translating the visual sampling into a 3d model first. Or maybe some neural representations that can generate the 3d models. Then train the movement on that rather than raw pixels.

Similarly, for textual prompts of interactions, first create a model that relates the word embedding to the same 3d modeling and physics interactions.

Obviously much easier said than done.

lostdog · on July 30, 2022

In industry you do see some decomposition into steps to solve problems, and there's lots of research papers that decompose too. It does work to solve problems, but decomposition also has major weaknesses.

First, you have to come up with a good intermediate representation, and that's pretty difficult. Your suggestion of a 3d model is good, but has a lot of complex design choices. Should you use a mesh or voxel representation? What resolution will work? How do you train the upstream model? As your problem gets more and more complex, so does your intermediate representation, and the engineering to get the intermediate representation working correctly become prohibitively difficult.

But we're not done yet! So now you've got an intermediate representation, a net that produces it, and a net that consumes it. Maybe your current results aren't good enough to publish, so you spend some time optimizing the upstream model. You take a few weeks and make massive improvements on the accuracy. Great! Right? But then you plug your improved upstream model into the system and get no change in overall performance. Turns out that your intermediate output got better, but in a way that doesn't matter to the task. Now you get to spend a month guessing how to tune your intermediate loss so a trained upstream model does improve end-to-end performance.

So yeah, decomposing the problem is often what you need to do in practice, but there's a reason that many researchers are trying to work on more scalable end-to-end approaches. It's very very difficult to answer both "What intermediate representation carries the information to perform this task," and "Which aspect of the intermediate representation is most important to optimize?"

It's especially difficult in the area the blog post is about, "unstructured robotics." You can't focus your representation on certain types of objects. The representation needs to somehow capture disparate things like "The drill bit must go near the screw," "This is the squishy part," and "This is attached with a hinge." Now you need to program some type of model that can describe everything possible in the world, and no amount of OOP courses will help you.

There is a 3rd way, which doesn't work quite yet: Do unsupervised/self-supervised training of the world model, with a representation that's just a bag of floats. Maybe you can learn a world model that's informative enough that you can plug any sort of goal in and get good results. It's still unclear whether this will beat out an end-to-end system, so there's more research to be done.

ilaksh · on July 30, 2022

I wasn't trying to suggest that there wasn't end-to-end training. Just that it was based on those models that were subsets of the problem first.

As far as "the drill bit must go near the screw", my idea is maybe there is a representation of 3d geometry, an attached representation of physics interactions, another understanding of types of shapes, another level of drills versus screws.

You don't program it, you train it end to end to reproduce the scenarios, but based on those modules.

I'm not saying it's easy or practical for most projects to do all of that from scratch. But to ke it seems like a sensible program.

As far as voxels versus mesh, I actually think somehow you want it to understand surfaces and high level shapes. Again a pretty big ask.

But the basic concept is that we know these are all important semantic categories and if we can create some modules that encapsulate the understanding then we can get more accurate and more efficient reconstruction.