
Self-Supervised Tracking via Video Colorization - ot
https://ai.googleblog.com/2018/06/self-supervised-tracking-via-video.html?r=1
======
cs702
Very clever, and "obvious" only in hindsight: Training a deep convnet to
colorize all frames in a grayscale video clip from a single color frame taken
from the same clip induces the neural net to learn to track _all_ objects in
the video, with robustness to occlusions, change of viewing angles, etc.
Labels are not required; only a color frame from each clip. Most impressively,
the embeddings learned by the convnet (i.e., the representations learned by
the next-to-last layer) are _linearly separable by object_. Very nice!

~~~
Eridrus
Simple to explain, yes, but I feel like this isn't even really "obvious" even
in hindsight. This whole thing is very clever.

The only complaint I have is that it's not better than supervised object
tracking, so I wonder if this idea is too late?

To draw a parallel to image classification, at one point in time neural nets
were trained with a bunch of unsupervised pre-training using reconstruction
loss, but that technique has basically fallen by the wayside as we've gotten
larger datasets and found a pile of tricks for training them from scratch.

~~~
cs702
Labeling object locations in all frames of a large number of video clips is
significantly more expensive than labeling a comparably large number of
images.

~~~
Eridrus
Sure, but like Imagenet, these datasets already exist. So unless these models
are quite brittle to the objects being tracked, this is likely not going to be
an issue.

~~~
vanderZwan
> Sure, but like Imagenet, these datasets already exist.

I'm pretty sure that making it cheap to use new datasets is _very_ valuable in
the long run

------
arnioxux
I saw something similar before (from gifs.com's sticker editor of all things
lol) where you annotate the segmentation of the first frame of the video and
it will propagate that segmentation to the rest of the frames:

[https://medium.com/gifs-ai/interactive-segmentation-with-
con...](https://medium.com/gifs-ai/interactive-segmentation-with-
convolutional-neural-networks-2e171a85df82#fc74)

------
AboutTheWhisles
I'm very skeptical that there is any merit to the 'tracking' over other
techniques, as well as the colorization being better than this 14 year old
paper:

[http://webee.technion.ac.il/people/anat.levin/papers/coloriz...](http://webee.technion.ac.il/people/anat.levin/papers/colorization-
siggraph04.pdf)

The results in their videos look very poor.

~~~
IanCal
Tracking is not the method they're using for colourizing, but the other way
around. Your linked paper has no tracking.

~~~
AboutTheWhisles
I realize that.

They are for some reason saying they can track things with their colorization,
when their colorization is extremely unimpressive as well as the tracking that
results from using it.

There is no reason colorization needs to happen to do the tracking anyway. The
tracking is unimpressive and now indirect.

This isn't some sort of epiphany they've discovered, they are just reinventing
video image segmentation very poorly.

Here are half a dozen examples from a 30 second google search:

[https://www.youtube.com/watch?v=juDvLrFQF0U](https://www.youtube.com/watch?v=juDvLrFQF0U)

[https://www.youtube.com/watch?v=JYgyDdLf7GQ](https://www.youtube.com/watch?v=JYgyDdLf7GQ)

[https://static.googleusercontent.com/media/research.google.c...](https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/36247.pdf)

[https://perso.liris.cnrs.fr/nicolas.bonneel/InteractiveMulti...](https://perso.liris.cnrs.fr/nicolas.bonneel/InteractiveMultilabelVideoSegmentation.pdf)

[http://files.is.tue.mpg.de/black/papers/TsaiCVPR2016.pdf](http://files.is.tue.mpg.de/black/papers/TsaiCVPR2016.pdf)

[https://graphics.ethz.ch/~perazzif/bvs/files/bvs.pdf](https://graphics.ethz.ch/~perazzif/bvs/files/bvs.pdf)

The only reason this is news is because it's google and the researchers seem
to think they've discovered something. Techniques like this with much better
results have been shown at Siggraph for decades.

~~~
dimatura
Sorry, but I think you're fundamentally misunderstanding the idea of the
paper. Colorization is not the point - it's an auxiliary task, that lets the
algorithm discover how to do a form of tracking.

As the paper itself states, the tracking results are not the absolute state of
the art, but they are in the same ballpark, and more importantly, learned
without supervision - just watching video. This makes it easier to train on
whatever dataset you might have lying around, and more importantly, it's a
clever, simple idea that can be improved on and adapted for different tasks.

(Disclaimer: Authors are acquaintances of mine.)

~~~
AboutTheWhisles
Again, I understand what they are doing very well. They noticed that
colorization tends to track things in video.

Of course colorization tracks to objects, it wouldn't work if it didn't.

This is essentially automatic video image segmentation, which itself is
heavily derived from and related to natural image matting.

Natural image matting could even be seen to be a combination of clustering and
somehow solving (or minimizing) the error in the matting equation described by
porter duffman compositing algebra.

So, automatic video segmentation can be seen as clustering over 3 dimensions
of pixels - x, y and time, with some loose expectations of coherency over
time.

There are many ways to achieve this, which should be obvious if you watch some
of the videos or glance at some of the papers I've linked.

One simple way is with a bilateral filter to iterate over the volume of
pixels, which gradually clusters them together. One of the papers shows this
technique.

Everything I linked gives much better results. They don't require 'deep
learning' and the idea that colorization follows objects is so trivial it's
nonsense to make a paper out of it. This is more a case of visibility and most
people not knowing the research that has already been done. That's
understandable for people here, but the authors of this paper should have
known better.

