This looks impressive. Much more than even the Boston Dynamics demonstrations.
Flipping a pancake is extremely difficult because each pancake is different. I know that these videos must be cherry-picked but to be able to train a Robot to do this just by demonstrating feels like a massive leap.
Flipping a pancake was done in 2010. What looks impressive for humans is easy for robots and vice versa:
https://youtu.be/W_gxLKSsSIE?si=HDyNXe1Ys_eFXiVU
Another case in point: robot juggling was done in 1990s and to date we do not have a robot that can open any door reliably like a human. Kind of like Moravecs Paradox
To be fair it is far more complex for a robot to grip a spatula and use that spatula on a griddle than to use dynamic motion to flip a pancake in a pan.
Solving any one problem with robotic manipulation isn’t all that hard. It takes a lot of trial and error, but in general if the task is constrained you can solve it reliably. The trick is to solve *new* tasks without resorting to all that fine tuning every time. Which is what Russ is claiming here. He’s training an LLM with a corpus of one-off policies for solving specific manipulation tasks, and claiming to get robust ad hoc policies from it for previously unsolved tasks.
If this actually works, it’s pretty important. But that’s the core claim: that he can solve ad hoc tasks without training or hand tuning.
> He’s training an LLM with a corpus of one-off policies for solving specific manipulation tasks, and claiming to get robust ad hoc policies from it for previously unsolved tasks.
It seems clear that many people do not understand that this is the key breakthrough: solving arbitrary tasks after learning previous, unrelated tasks.
In my opinion that really is a good definition of intelligence, and puts this technique at the forefront of machine intelligence.
Flipping a pancake in a "random kitchen" would be much more difficult and have many of the same issues as the door problem.
It's hard to point to a single thing that would make "flipping pancakes" intractable, it's sort of the other way around, to usefully flip pancakes in the same way as a person takes a lot of skills chained together.
The "door problem" is a sort of compendium of many real-world skills, identifying the door, understanding its affordances and how to grip / manipulate them, whether to push or pull the door, predicting the trajectory of the door when opened, estimating the mass of the door and applying the right amount of force, understanding if there any springs or pulls on the door and how it must be held to traverse through it. Etc. There are also a ton of things I'm missing that are so fundamental one tends to take them for granted, like knowing your own size and that you can't fit through a tiny doorway.
I think you can ramp towards the "door problem" in difficulty by slowly relaxing constraints. A video linked above (not article) shows "can flip a pancake successfully with a particular pan (you are already holding) and pancake with a fixed camera and visual markers". Ok, now do it in varying lighting conditions. With no visual markers. With different camera views. Different pancakes. Real pancakes (which are not rigid, and sometimes stick to the pan). Different pans. Now you have to pick up the pan. Use a stove. Different stoves. Identify griddle vs pan and use the right flipping technique. Find everything and do it all in a messy kitchen... eventually you're getting to same ballpark as the "door problem".
physicist here (so very naive on these topics) - I’m wondering how to compare the steps you mention regarding the door problem (especially the predictive ones, e.g. about the trajectory of the door as it opens, etc) with how humans open doors? Surely people don’t stop in front of a door and begin planning things out, rather they seem to go for it and adjust on the fly, is this an approach that won’t work in robotics? Why not?
So classical robotics yeah, people used to write code for each step of opening a door. Practically speaking you would probably not do motion planning on the door, you would just code it up with a bunch of heuristics like, try to be over here in relation to the doorframe because that's a good opening spot and will probably work. Ok you're in the right place? Now, move gripper towards the door handle... etc. Bunch of hacks. Put enough hacks together and you can kinda sorta open (some) doors. Oh this is a SLIDING door? Damn we forgot to code for that...
The way things are going is sensors (cameras, force, etc) and neural networks. You let the robot try a bunch of ways of opening doors, sometimes it doors itself in the face, eventually it'll figure out good places to stand based on what the door looks like. The more doors you make it try to open hopefully the better it gets at generalising over the task of opening doors. The hacks/heuristics are really still there but the robot is supposed to learn them.
> Surely people don’t stop in front of a door and begin planning things out, rather they seem to go for it and adjust on the fly, is this an approach that won’t work in robotics? Why not?
Yeah, figuring out how to do this is basically "the problem". Most people don't have a sense or feeling of "planning things out" as they open a door because we reached "unconscious competence" at that task. We definitely have predictions of what is going to happen as we start opening the door based on prior experience and our observations so far. If reality diverges from our expectations we will experience surprise, make some new predictions, take actions to resolve the surprise, etc.
Not sure that anyone has ever studied how people open doors in detail, it'd be interesting. I bet there are a ton of subtle micro behaviours. One that I know is, if you hear kids running in the house it is a good idea to keep a foot planted in front of you as you approach the door, because those guys will absolutely fling or crash doors open right into your face.
Thank you, great answer. As soon as I had asked my question I realized that we must have a lot of unconscious behaviours. Very interesting points about surprise/expectations. And top marks for the advice about kids
I was responding to address why the "door problem" is more difficult than "pancake flipping under controlled conditions".
(I also ignored that door opening is generally done by mobile robots of a certain weight class which tend to be more expensive than a stationary arm with enough strength to pick up a spatula or hold a pan).
There is a steep difficulty gradient from "works in the lab" to "works under semi-controlled real world conditions" to "works in uncontrolled real-world situations".
Not just touch but proprioception. Robots in human environments will have to be better at proprioception than 98% of humans. If I bump into you it’s typically anything from annoying to a meetcute. I’m a pretty big guy, but if you had to chose me to step on your foot or somebody else, it’s probably me you want, because I will shift my weight off your foot before you even know what happened (tai chi) because you will barely notice.
If instead your choice is your high school bully or a robot, well for now pick the bully. Because that robot isn’t even being vicious and will hurt more.
> Because that robot isn’t even being vicious and will hurt more.
Rodney Brooks at the MIT AI Lab was a big advocate of something called "series elastic actuators." The idea was was that you didn't allow motors to directly turn robot joints. Instead, all motors acted through some kind of elastic. And the robots could also measure how much resistance they encountered and back off.
MIT had a number of demos of robots that played nicely around fragile humans. I remember video of a grad student stopping a robot arm with their head.
Now, using series elastic actuators will sacrifice some amount of speed or precision. You wouldn't want to do it for industrial robots. And of course, robots also tend to be heavy and made of metal, so even if they move gently, they still pose real risks.
But real progress has been made on these problems.
I think you're probably right, and those non-linear systems are going to make me have to increase my estimate for how long it takes for a robot to go from 5 year old child to ninja physicality. The more complex the feedback mechanisms, the more complexity there is in, for instance, screwing in a screw as fast as possible.
The robot won't take any enjoyment out of it, and won't laugh at your pain. Won't post about it on social media. Isn't going to try and fuck your ex or sister or mom.
I'm pretty sure that if I had never opened a door before and I saw somebody opening a door in a video, I would immediately know how to open doors just by watching the video. And that would be any door, with any kind of door handle. Not because I got superpowers, but because I'm average-human.
So, the moment your system needs this kind of data and that kind of data, oh and btw it needs a few hundreds of thousands of examples of all those kinds of data, that's very clear to me that it's far away from being capable of any kind of generalisation, any kind of learning general behaviour.
So that's "60 difficult, dexterous skills" today, "1,000 by the end of 2024", and when the aliens land to explore the ruins of our civilisation your robot will still be training on the 100,000'th "skill". And still will fall over and die when the aliens greet it.
I think their robot has a way of converting touch to a video input. The white bubble manipulator has a pattern printed on the inside that a camera watches for movement. (see 1:58 of the video).
And here I thought manual labor jobs were safe for a very long time. I really hope people at the policy level are thinking about what it looks like to have a world of people that don’t have any work to do.
Flipping a pancake is extremely difficult because each pancake is different. I know that these videos must be cherry-picked but to be able to train a Robot to do this just by demonstrating feels like a massive leap.