
Learning Dense Visual Object Descriptors By and For Robotic Manipulation - sahin-boydas
https://arxiv.org/abs/1806.08756
======
quantumwoke
The state of the art of robotic manipulation is really cool: robotic hands are
able to figure out which object is which as in this video, the best points to
grasp and pressure to apply to hold and move objects. Just need some group to
combine it all and demonstrate a real world application of it!

Obligatory video dump because this stuff is just so fascinating:

[https://www.youtube.com/watch?v=mIEbU7GfRhQ](https://www.youtube.com/watch?v=mIEbU7GfRhQ)

[https://www.youtube.com/watch?v=DPl_d7lbL84](https://www.youtube.com/watch?v=DPl_d7lbL84)

[https://www.youtube.com/watch?v=tLNyXAE7mLM](https://www.youtube.com/watch?v=tLNyXAE7mLM)

[https://www.youtube.com/watch?v=ZhsEKTo7V04](https://www.youtube.com/watch?v=ZhsEKTo7V04)

~~~
YeGoblynQueenne
Conversely to your comment, I find that -if this paper really represents the
state-of-the-art- then the state-of-the-art is extremely limited.

The authors of the paper report as their major contribution a technique that
must be applied to each specific object separately and their experiments cover
47 objects in 3 "distinct classes" (shoes, hats, mugs and a mix of objects
from diverse other classes):

 _We’ve also shown that self-supervised dense visual descriptor learning can
be applied to a wide variety of potentially non-rigid objects and classes (47
objects so far, including 3 distinct classes), can be learned quickly
(approximately 20 minutes), and enables new manipulation tasks._

I'm guessing that by "real world application" you mean an application where a
robot hand is required to manipulate arbitrary objects of unconstrained shape.
Even given the relatively short training time reported in the paper (though
with a pre-trained CNN model to begin with) this is not something that can be
reasonably expected by this kind of system. There are potentially billions of
different object shapes in the real world!

For real world applications a giant leap forward is required: a kind of AI
system with the ability to generalise well from few examples. This seems to
still be far out of the reach of the current generation of deep neural net-
driven vision systems.

~~~
GlenTheMachine
"Conversely to your comment, I find that -if this paper really represents the
state-of-the-art- then the state-of-the-art is extremely limited."

The state of the art in robotic manipulation _is_ extremely limited (the state
of the art in robotic locomotion is also extremely limited, but somewhat
easier to paper over - simplified special cases get you farther in locomotion
than in manipulation). IMHO, the layer of robotic control systems above servo
control, but below spatial reasoning is an enormous hole in the field's
research portfolio. We've got pretty decent machine vision at this point --
still not up to human standards, but improving rapidly. We've got excellent
servo control laws. We've got good planning and scheduling techniques. What we
don't have is techniques for robustly and rapidly performing dexterous motion
planning, particularly where the contact forces with the environment are an
important part of the problem.

I was at Robotics-Science and Systems this year and heard a talk by Sergey
Levine, one of the handful of researchers addressing this topic. He put up a
picture of IBM Deep Blue playing Garry Kasparov. He pointed out that there was
no robot in the picture -- there was a human moving Deep Blue's pieces around
on the board. The point was that we are now in a strange place in robotics:
the hardest part of playing a grandmaster--level chess match isn't figuring
out the moves. It's reaching out and picking up the pieces.

~~~
YeGoblynQueenne
Thank you for the useful insight. I'm not an expert in robotics, far from.

Like you say, machine vision has progressed, but it's still very far from
human vision. For one thing, most of the progress is in object classification,
so any problem that can't be reformulated as a classification problem is
basically out of the question. Then, there's the issue of poor generalisation
between classes- for any class of object that we want a system to recognise,
we must train the system anew, at great cost- except for pre-training of
course (which basically means someone else already did the hard work).

~~~
GlenTheMachine
...and evenif we completely solved the machine vision problem, the motion
planning under cintact problem would remain.

At some point the world will wake up and realize that this problem is
approximately as hard as the machine vision problem, and will devote resources
to solving it. But that awaeness hasn't happened yet.

------
modeless
The associated video:
[https://www.youtube.com/watch?v=L5UW1VapKNE](https://www.youtube.com/watch?v=L5UW1VapKNE)

------
pjc50
We've come not as far as you'd expect since SHRDLU (1968):
[https://en.wikipedia.org/wiki/SHRDLU](https://en.wikipedia.org/wiki/SHRDLU)

------
bmh
This is incredible! The temporal consistency is amazing. Watch the video.

