Review — Panoptic-DeepLab: A Simple, Strong, and Fast Baseline for Bottom-Up Panoptic Segmentation

Panoptic Segmentation Using Semantic Segmentation & Instance Segmentation

Sik-Ho Tsang
5 min readFeb 17


Panoptic-DeepLab predicts three outputs: semantic segmentation, instance center prediction and instance center regression. Class-agnostic instance segmentation, obtained by grouping predicted foreground pixels to their closest predicted instance centers, is then fused with semantic segmentation by majority-vote rule to generate final panoptic segmentation.

Panoptic-DeepLab: A Simple, Strong, and Fast Baseline for Bottom-Up Panoptic Segmentation,
Panoptic-DeepLab, by UIUC, & Google Research,
2020 CVPR, Over 350 Citations (Sik-Ho Tsang @ Medium)
Panoptic Segmentation, Semantic Segmentation, Instance Segmentation

  • Panoptic-DeepLab is proposed, which adopts the dual-ASPP and dual-decoder structures specific to semantic, and instance segmentation, respectively.
  • The semantic segmentation branch is the same as the typical design of any semantic segmentation model (e.g., DeepLab), while the instance segmentation branch is class-agnostic, involving a simple instance center regression.


  1. Panoptic-DeepLab
  2. Results

1. Panoptic-DeepLab

Panoptic-DeepLab adopts dual-context and dual-decoder modules for semantic segmentation and instance segmentation predictions.

1.1. Model Architecture

  • The encoder backbone is an ImageNet-pretrained neural network paired with atrous convolution for extracting denser feature maps in its last block. Separate ASPP (DeepLabv2) and decoder modules are employed for semantic segmentation and instance segmentation.
  • The light-weight decoder module follows DeepLabv3+ with two modifications: (1) an additional low-level feature with output stride 8 is introduced to the decoder, thus the spatial resolution is gradually recovered by a factor of 2, and (2) in each upsampling stage, a single 5×5 depthwise-separable convolution (MobileNetV1) is applied.

1.2. Semantic Segmentation Head

  • A weighted bootstrapped cross entropy loss, DeeperLab, is used where it weights each pixel differently.

1.3. Class-Agnostic Instance Segmentation Head

  • For every foreground pixel (i.e., pixel whose class is a ‘thing’), the offset to its corresponding mass center is further predicted. Ground-truth instance centers are encoded by a 2D Gaussian with standard deviation of 8 pixels.
  • Mean Squared Error (MSE) loss is used to minimize the distance between predicted heatmaps and 2D Gaussian-encoded groundtruth heatmaps.
  • L1 loss is used for the offset prediction, which is only activated at pixels belonging to object instances.

1.4. Panoptic Segmentation

  • A highly efficient majority voting algorithm is used to merge semantic and instance segmentation into final panoptic segmentation.
  • The instance id for the pixel is the index of the closest instance center after moving the pixel location (i, j) by the offset O(i, j):
  • where ˆki,j is the predicted instance id. Semantic segmentation prediction is used to filter out ‘stuff’ pixels whose instance id are always set to 0.
  • A fast and parallelizable method is used to merge the predicted semantic segmentation and class-agnostic instance segmentation results following the “majority vote” principle proposed in DeeperLab.
  • The semantic label of a predicted instance mask is inferred by the majority vote of the corresponding predicted semantic labels.

1.5. Instance Segmentation

  • The class-specific confidence score for each instance mask is:
  • where Score(Objectness) is unnormalized objectness score obtained from the class-agnostic center point heatmap, and Score(Class) is obtained from the average of semantic segmentation predictions within the predicted mask region.

1.6. Loss Function

  • Panoptic-DeepLab is trained with three loss functions: weighted bootstrapped cross entropy loss for semantic segmentation head (Lsem); MSE loss for center heatmap head (Lheatmap); and L1 loss for center offset head (Loffset):
  • set λsem=3 for pixels belonging to instances with an area smaller than 64×64 and λsem=1 everywhere else, following DeeperLab.
  • λheatmap=200 and λoffset=0.01.

2. Results

2.1. Ablation Study

Ablation studies on Cityscapes val set.
  • MSE loss brings 0.8% PQ improvement.
  • Applying both dual-decoder and dual-ASPP, which gives us 0.7% PQ improvement.
  • Employing a large crop size 1025×2049 further improves the AP and mIoU by 0.6% and 0.9% respectively.

Finally, increasing the feature channels from 128 to 256 in the semantic segmentation branch achieves the best result of 63.0% PQ, 35.3% AP, and 80.5% mIoU.

2.2. Cityscapes

Left: Cityscapes val set. Right: Cityscapes test set.
  • Val Set (Left): When using only fine annotations, the best Panoptic-DeepLab, with multi-scale inputs and left-right flips, outperforms the best bottom-up approach, SSAP, by 3.0% PQ and 1.2% AP, and is better than the best proposal-based approach, AdaptIS, by 2.1% PQ, 2.2% AP, and 2.3% mIoU.
  • When using extra data, the best Panoptic-DeepLab outperforms UPSNet by 5.2% PQ, 3.5% AP, and 3.9% mIoU, and Seamless by 2.0% PQ and 2.4% mIoU.
  • Test Set (Right): The single unified Panoptic-DeepLab achieves state-of-the-art results, ranking first at all three Cityscapes tasks, when comparing with published works.

2.3. Mapillary Vistas & COCO

Panoptic-DeepLab also obtains good results.

  • (I skip the figures here for shorter story. Please feel free to read the paper directly for more details.)

2.4. Running Time

Left: PQ vs. Seconds. Right: End-to-end runtime, including merging semantic and instance segmentation.

Panoptic-DeepLab achieves the best trade-off across all three datasets.

2.5. Visualizations

Visualization of Panoptic-DeepLab with Xception-71 on Mapillary Vistas val set.

Only single scale inference is used here and the model achieves 37.7% PQ.


[2020 CVPR] [Panoptic-DeepLab]
Panoptic-DeepLab: A Simple, Strong, and Fast Baseline for Bottom-Up Panoptic Segmentation

1.5. Semantic Segmentation / Scene Parsing / Instance Segmentation / Panoptic Segmentation

2014–2020 [DRRN Zhang JNCA’20] [Trans10K, TransLab] [CCNet] [Open Images] [DETR] [Panoptic-DeepLab] 2021 [PVT, PVTv1] [SETR] [Trans10K-v2, Trans2Seg] [Copy-Paste] 2022 [PVTv2] [YOLACT++]

==== My Other Previous Paper Readings ====



Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: for Twitter, LinkedIn, etc.