
Extracting image metadata at scale - _jomo
http://techblog.netflix.com/2016/03/extracting-image-metadata-at-scale.html
======
woodman
For their image resizing tasks, I wonder if they've tried anything more
complex than simply cropping around points of interest, something like seam
carving [0]. I imagine that it would be pretty cheap to run a bunch of
different algorithms on an image and then A|B test it on Amazon mechanical
turk.

[0]
[https://en.wikipedia.org/wiki/Seam_carving](https://en.wikipedia.org/wiki/Seam_carving),
[https://www.youtube.com/watch?v=6NcIJXTlugc](https://www.youtube.com/watch?v=6NcIJXTlugc),
[https://www.youtube.com/watch?v=AJtE8afwJEg](https://www.youtube.com/watch?v=AJtE8afwJEg)

~~~
tekromancr
That looks amazing, and relatively easy to implement. However it seems that
Mitsubishi owns a patent on it, so maybe we will start seeing it used in __
years when they expire.

~~~
sbarre
Seam carving has been in Photoshop, and other applications, for years now (as
is clearly mentioned in the linked Wikipedia article).

------
maciejgryka
This is very interesting, but the real question is: how do you test which
approach is better?

For example, in the text detection case there are almost unlimited
combinations of transforms that you can put together. Usually you use some
hybrid of gut feeling and results to decide, but I bet Netflix has enough data
to make that call in a more principled way.

Would be awesome to hear about that. How do you create a labeled dataset? How
exactly do you measure which approach is better? Is there a perceptual element
to it, or is it all quantitative?

Edit: here's the related money quote from the retargeting paper linked in the
other comment:

    
    
      "In terms of objective measures for retargeting, our results show that we are still a long way from imitating human perception. There is a relatively  large discrepancy  between such measures and the subjective data we collected, and in fact, the most preferred algorithm by human viewers, SV, received low ranking by almost all automatic distance measures."

------
Xyik
don't think it actually talks much about how it does it at ' scale'. how
expensive is it to perform these operations? are images cropped dynamically as
they are requested or do they pre-process the images and cache it somewhere.

did they do anything clever to parallelize the process? what underlying
technologies do they use...

~~~
maciejgryka
From the code samples, it looks like OpenCV... which is pretty hard to beat
for well-understood image processing algos like thresholdling etc.

I guess at this point you can do it "at scale" by throwing enough servers and
caching at the problem :)

~~~
zitterbewegung
It would be more interesting to see what they are using to manage the servers
and run opencv which is what the comment was asking .

------
zetazzed
The authors may want to consider how much of this work could be done easily
and effectively with deep learning. For content-based search and image
similarity, even simple, pre-trained convnets will likely crush the histogram-
based approaches you have here.

Just run your images through Google Cloud vision to do the face detection and
text detection. With 2M images, it will be cheaper than the amount of dev time
you spent here, and you'll get excellent quality.

~~~
TTPrograms
They explain that not all of the images they want are faces in this case, so
you'd have to train your own on "interesting regions" (though there is some
work in that area). Part of the challenge in that case is generating all your
labels for what the interesting regions are. This way they don't need to
generate labels, at least.

~~~
aab0
Youtube did something similar for 'interesting thumbnails' last year with deep
nets (many uploaders do not specify a good thumbnail preview), and reported
that it gave a nice performance boost.

