Brief Review — DeeperLab: Single-Shot Image Parser

DeeperLab, Image Parser, Fusing Semantic Segmentation and Keypoint Predictions

Sik-Ho Tsang
6 min readFeb 8


DeeperLab: Single-Shot Image Parser,
DeeperLab, by MIT, Google Inc., and UC Berkeley
2019 arXiv v2, Over 150 Citations (Sik-Ho Tsang @ Medium)
Panoptic Segmentation, Instance Segmentation, Scene Parsing, Semantic Segmentation

  • DeeperLab image parser is proposed to perform whole image parsing with a significantly simpler, fully convolutional approach.
  • It jointly addresses the semantic and instance segmentation tasks in a single-shot manner. This leads to fast processing.


  1. DeeperLab Encoder & Decoder
  2. DeeperLab Image Parsing Prediction Heads
  3. Results

1. DeeperLab Encoder & Decoder

The proposed single-shot, bottom-up network architecture employs the encoder-decoder structure and produces per-pixel semantic and instance predictions.
  • The proposed network adopts the encoder-decoder paradigm. For efficiency, the semantic segmentation and instance segmentation are generated from the shared decoder output and then fused to produce the final image parsing result.

1.1. Encoder

  • Two networks built on the efficient depthwise separable convolution, as in MobileNetV1, are used: The standard Xception-71 for higher accuracy, and a novel Wider variant of MobileNetV2 for faster inference.
  • This wider MobileNetV2 replaces all the 3×3 convolutions with 5×5 convolutions. This approach efficiently increases the receptive field to 981×981.
  • Additionally, the network backbone is augmented with the effective ASPP module (Atrous Spatial Pyramid Pooling), as in DeepLabv2 and DeepLabv3.

1.2. Decoder

An example of the space-to-depth (S2D) and depth-to-space (D2S) operations.
  • Following DeepLabv3, the number of channels of ASPP outputs, and the low-level feature map, are first individually reduced by 1×1 convolution and then concatenated together.
  • The decoder uses two large kernel (7×7) depthwise convolutions to further increase the receptive field. The resultant feature map has 4096 channels, which is then reduced to 256 by depth-to-space operation.
  • The resultant feature maps are used as inputs of Image Parsing Prediction Heads.

2. DeeperLab Image Parsing Prediction Heads

  • (Here, it is a very fine details, I try my best to shorten and roughly present the idea here, otherwise, the story will be too long.)

2.1. Semantic Segmentation Head

  • The bootstrapped cross-entropy loss is used.
  • The pixels are sorted based on the cross-entropy loss and the errors are only backpropagated in the top-K positions (hard example mining).
  • K=0.15×N, where N is the total number of pixels in the image.
  • where yi is the target class label for pixel i, pi,j is the predicted posterior probability for pixel i and class j, and 1[x] = 1 if x is true and 0 otherwise.
  • weight wi=3 for pixels that belong to instances with an area smaller than 64×64 and wi=1 everywhere else.

By doing so, the network is trained to focus on both hard pixels and small instances.

2.2. Instance Segmentation Heads

Four prediction maps generated by our instancerelated heads: (a) keypoint heatmap, (b) long-range offset, (c) short-range offset, and (d) middle-range offset. The red stars denote the keypoints, the green disk denotes the target for keypoint prediction, and the blue lines/arrows denote the offsets from the current pixel to the target keypoint.
  • (It is better to read PersonLab for more details in this part.)
  • A keypoint-based representation is used, which is the 4 bounding box corners and the center of mass as P=5 object keypoints.
  • Following PersonLab, four prediction heads are defined, which are used for instance segmentation: a keypoint heatmap as well as long-range, short-range, and middle-range offset maps.
  • (a) The keypoint heatmap: predicts whether a pixel is within a disk of radius R=25 pixels centered in the corresponding keypoint.
  • The target activation is equal to 1 in the interior of the disks and 0 elsewhere. The standard sigmoid cross entropy loss is used.
  • (b) The long-range offset map: predicts the position offset from a pixel to all the corresponding keypoints, encoding the long-range information for each pixel.
  • The predicted long-range offset map has 2P channels, where every two channels predict the offset in the horizontal and vertical directions for each keypoint. L1 loss is used, and is only activated at pixels belonging to object instances.
  • (c) The short-range offset map: is similar to the long-range offset map except that it only focuses on pixels within the disk of radius R=25 pixels to improve keypoint localization. L1 loss is used, and only activated at the interior of the disks.
  • (d) The middle-range offset map: predicts the offset among keypoint pairs, defined in a directed keypoint relation graph (DKRG). This map is used to group keypoints that belong to the same instance.
  • It has 2E channels, where E=8 is the number of directed edges, to predict the horizontal and vertical offsets. L1 loss is used, which is only activated at the interior of the disks.

2.3. Instance Prediction

  • Instance segmentation map is generated from the four instance-related prediction maps similarly to PersonLab, with recurive offset refinement, and keypoint localization.
  • The keypoints are clustered to detect instances by using a fast greedy algorithm.
  • Finally, given the detected instances, an instance label is assigned to each pixel by using the predicted long-range offset map, which encodes the pixel-wise offset to the keypoints.

2.4. Semantic and Instance Prediction Fusion

  • Pixels that predicted to have a ‘stuff‘ class are assigned with a single unique instance label.
  • For the other pixels, their instance labels are determined from the instance segmentation result while their semantic labels are resolved by the majority vote of the corresponding predicted semantic labels.

3. Results

3.1. Metrics

  • Panoptic Quality (PQ) metric as in PS is used:
  • Authors argued that argue that PQ is suitable in applications where one cares equally for the parsing quality of instances irrespective of their sizes.
  • Parsing Covering (PC) metric, is proposed:

3.2. Performance

DeeperLab performance on the Mapillary Vistas validation set.
  • Xception-71 based model attains 31.95% PQ and 55.26% PC, while the Wider MobileNetV2 based model achieves 25.20% PQ and 49.80% PC with faster inference (6.19 vs. 3.09 fps on GPU).
DeeperLab performance on the Mapillary Vistas test set. +: input size is downsampled by 2 (721×721).
  • Test set result is also obtained, where only PQ is provided by the test server.

3.3. Visualizations

Few image parsing results on the Mapillary Vistas validation set with proposed DeeperLab based on Xception-71.
  • The model does not generate any VOID labels.
  • (There are also other ablation studies, please feel free to read the paper directly.)


[2019 arXiv v2] [DeeperLab]
DeeperLab: Single-Shot Image Parser

1.6. Instance Segmentation

2014–2019 … [DeeperLab] 2020 [Open Images] 2021 [PVT, PVTv1] [Copy-Paste] 2022 [PVTv2] [YOLACT++]

1.7. Panoptic Segmentation

2019 [PS] [UPSNet] [Semantic FPN, Panoptic FPN] [DeeperLab] 2020 [DETR]

==== My Other Previous Paper Readings ====



Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: for Twitter, LinkedIn, etc.