
Semantic Image Segmentation with DeepLab in Tensorflow - EvgeniyZh
https://research.googleblog.com/2018/03/semantic-image-segmentation-with.html
======
genericpseudo
If you're interested in this but have no background, the best place to start
is "Fully Convolutional Networks for Semantic Segmentation" –
[https://people.eecs.berkeley.edu/~jonlong/long_shelhamer_fcn...](https://people.eecs.berkeley.edu/~jonlong/long_shelhamer_fcn.pdf)

This is a very active field of research. Another thread worth pulling on is
Mask R-CNN:
[https://arxiv.org/abs/1703.06870](https://arxiv.org/abs/1703.06870)

It's not quite as simple as "this one has highest mAP, let's use it"; the
tradeoffs are complex. In particular, as you can see in the image here, one
thing DeepLab doesn't do is segment _instances_ – so you get a mask of
"people", not a mask per person. Mask R-CNN does a better job on that by
design, because it predicts both bounding boxes and a mask per bounding box.

~~~
alexcnwy
Great summary. I believe both models are available in Detectron if anyone
wants to give them a go:

[https://github.com/facebookresearch/Detectron](https://github.com/facebookresearch/Detectron)

~~~
black_puppydog
Yes for Mask-RCNN. For FCN, there is R-FCN.

Overall I'm really happy to work in a domain where people share their code and
models in such an open way. I take issue with detectron in particular though,
because a company the size of facebook in the year of 2018 has no excuse to
publish a major software package in python 2. The oldest models they implement
are from 2015 (excluding VGG16 which is so prolific it's available in
literally every library as python 3) and caffe2 is quite a bit more recent
than that. Like I said. No excuse...

~~~
genericpseudo
The team behind Detectron have published an enormous amount of really good
research, but the Detectron codebase struck me as "good research code" rather
than something you'd ideally want in production.

~~~
black_puppydog
Of course, I'm not criticising the fact that they publish those models, nor
the models themselves. But even publishing arguably polished python2 code in
2018 is something I take issue with if it's not a legacy code base

------
andreykurenkov
Link to Arxiv (DeepLabv1):
[https://arxiv.org/abs/1606.00915](https://arxiv.org/abs/1606.00915)

Link to Arxiv (DeepLabv3):
[https://arxiv.org/abs/1706.05587](https://arxiv.org/abs/1706.05587)

Link to GitHub:
[https://github.com/tensorflow/models/tree/master/research/de...](https://github.com/tensorflow/models/tree/master/research/deeplab)

The README on there has a very neat TLDR of the model:

"DeepLabv1 [1]: We use atrous convolution ['s a shorthand for convolution with
upsampled filter'] to explicitly control the resolution at which feature
responses are computed within Deep Convolutional Neural Networks.

DeepLabv2 [2]: We use atrous spatial pyramid pooling (ASPP) ['a
computationally efficient scheme of resampling a given feature layer at
multiple rates prior to convolution'] to robustly segment objects at multiple
scales with filters at multiple sampling rates and effective fields-of-views.

DeepLabv3 [3]: We augment the ASPP module with image-level feature [5, 6] to
capture longer range information. We also include batch normalization [7]
parameters to facilitate the training. In particular, we applying atrous
convolution to extract output features at different output strides during
training and evaluation, which efficiently enables training BN at output
stride = 16 and attains a high performance at output stride = 8 during
evaluation.

DeepLabv3+ [4]: We extend DeepLabv3 to include a simple yet effective decoder
module to refine the segmentation results especially along object boundaries.
Furthermore, in this encoder-decoder structure one can arbitrarily control the
resolution of extracted encoder features by atrous convolution to trade-off
precision and runtime."

------
Aeolos
Congratulations, Deeplab 3+ finally discovered that the U-net architecture,
first proposed 3 years ago, is more efficient than the flat architecture they
used before.

Deeplab 3+ is still a wildly inefficient network structure, but it undeniably
works, if you can afford the computational resources. Just keep in mind you
can achieve similar results (within 1% mIOU) with much leaner structures.

------
jack_pp
Is this fast enough to be used as a background removal in live streams?

~~~
genericpseudo
Not at the kind of resolution you'd want to be using on, e.g., Twitch. In that
setting, you could just use chromakey, though? That's '70s technology, cheap
and very reliable.

~~~
jack_pp
You could but it's cumbersome, amateur streamers might not wish to invest in
the setup

