This is a very active field of research. Another thread worth pulling on is Mask R-CNN: https://arxiv.org/abs/1703.06870
It's not quite as simple as "this one has highest mAP, let's use it"; the tradeoffs are complex. In particular, as you can see in the image here, one thing DeepLab doesn't do is segment instances – so you get a mask of "people", not a mask per person. Mask R-CNN does a better job on that by design, because it predicts both bounding boxes and a mask per bounding box.
Overall I'm really happy to work in a domain where people share their code and models in such an open way.
I take issue with detectron in particular though, because a company the size of facebook in the year of 2018 has no excuse to publish a major software package in python 2.
The oldest models they implement are from 2015 (excluding VGG16 which is so prolific it's available in literally every library as python 3) and caffe2 is quite a bit more recent than that. Like I said. No excuse...
Link to Arxiv (DeepLabv3): https://arxiv.org/abs/1706.05587
Link to GitHub: https://github.com/tensorflow/models/tree/master/research/de...
The README on there has a very neat TLDR of the model:
"DeepLabv1 : We use atrous convolution ['s a shorthand for convolution with upsampled filter'] to explicitly control the resolution at which feature responses are computed within Deep Convolutional Neural Networks.
DeepLabv2 : We use atrous spatial pyramid pooling (ASPP) ['a computationally
efficient scheme of resampling a given feature layer at
multiple rates prior to convolution'] to robustly segment objects at multiple scales with filters at multiple sampling rates and effective fields-of-views.
DeepLabv3 : We augment the ASPP module with image-level feature [5, 6] to capture longer range information. We also include batch normalization  parameters to facilitate the training. In particular, we applying atrous convolution to extract output features at different output strides during training and evaluation, which efficiently enables training BN at output stride = 16 and attains a high performance at output stride = 8 during evaluation.
DeepLabv3+ : We extend DeepLabv3 to include a simple yet effective decoder module to refine the segmentation results especially along object boundaries. Furthermore, in this encoder-decoder structure one can arbitrarily control the resolution of extracted encoder features by atrous convolution to trade-off precision and runtime."
Deeplab 3+ is still a wildly inefficient network structure, but it undeniably works, if you can afford the computational resources. Just keep in mind you can achieve similar results (within 1% mIOU) with much leaner structures.