Review — Panoptic-DeepLab: A Simple, Strong, and Fast Baseline for Bottom-Up Panoptic Segmentation

Panoptic Segmentation Using Semantic Segmentation & Instance Segmentation

5 min readFeb 17, 2023

**Panoptic-DeepLab predicts three outputs: semantic segmentation, instance center prediction and instance center regression.** Class-agnostic instance segmentation, obtained by grouping predicted foreground pixels to their closest predicted instance centers, is then fused with semantic segmentation by majority-vote rule to generate final panoptic segmentation.

Panoptic-DeepLab: A Simple, Strong, and Fast Baseline for Bottom-Up Panoptic Segmentation,
Panoptic-DeepLab, by UIUC, & Google Research,
2020 CVPR, Over 350 Citations (Sik-Ho Tsang @ Medium)
Panoptic Segmentation, Semantic Segmentation, Instance Segmentation

Panoptic-DeepLab is proposed, which adopts the dual-ASPP and dual-decoder structures specific to semantic, and instance segmentation, respectively.
The semantic segmentation branch is the same as the typical design of any semantic segmentation model (e.g., DeepLab), while the instance segmentation branch is class-agnostic, involving a simple instance center regression.

Outline

Panoptic-DeepLab
Results

1. Panoptic-DeepLab

**Panoptic-DeepLab adopts dual-context and dual-decoder modules for semantic segmentation and instance segmentation predictions.**

1.1. Model Architecture

The encoder backbone is an ImageNet-pretrained neural network paired with atrous convolution for extracting denser feature maps in its last block. Separate ASPP (DeepLabv2) and decoder modules are employed for semantic segmentation and instance segmentation.
The light-weight decoder module follows DeepLabv3+ with two modifications: (1) an additional low-level feature with output stride 8 is introduced to the decoder, thus the spatial resolution is gradually recovered by a factor of 2, and (2) in each upsampling stage, a single 5×5 depthwise-separable convolution (MobileNetV1) is applied.

1.2. Semantic Segmentation Head

A weighted bootstrapped cross entropy loss, DeeperLab, is used where it weights each pixel differently.

1.3. Class-Agnostic Instance Segmentation Head

For every foreground pixel (i.e., pixel whose class is a ‘thing’), the offset to its corresponding mass center is further predicted. Ground-truth instance centers are encoded by a 2D Gaussian with standard deviation of 8 pixels.
Mean Squared Error (MSE) loss is used to minimize the distance between predicted heatmaps and 2D Gaussian-encoded groundtruth heatmaps.
L1 loss is used for the offset prediction, which is only activated at pixels belonging to object instances.

1.4. Panoptic Segmentation

A highly efficient majority voting algorithm is used to merge semantic and instance segmentation into final panoptic segmentation.
The instance id for the pixel is the index of the closest instance center after moving the pixel location (i, j) by the offset O(i, j):

where ˆki,j is the predicted instance id. Semantic segmentation prediction is used to filter out ‘stuff’ pixels whose instance id are always set to 0.
A fast and parallelizable method is used to merge the predicted semantic segmentation and class-agnostic instance segmentation results following the “majority vote” principle proposed in DeeperLab.
The semantic label of a predicted instance mask is inferred by the majority vote of the corresponding predicted semantic labels.

1.5. Instance Segmentation

The class-specific confidence score for each instance mask is:

where Score(Objectness) is unnormalized objectness score obtained from the class-agnostic center point heatmap, and Score(Class) is obtained from the average of semantic segmentation predictions within the predicted mask region.

1.6. Loss Function

Panoptic-DeepLab is trained with three loss functions: weighted bootstrapped cross entropy loss for semantic segmentation head (Lsem); MSE loss for center heatmap head (Lheatmap); and L1 loss for center offset head (Loffset):

set λsem=3 for pixels belonging to instances with an area smaller than 64×64 and λsem=1 everywhere else, following DeeperLab.
λheatmap=200 and λoffset=0.01.

2. Results

2.1. Ablation Study

**Ablation studies on Cityscapes val set.**

MSE loss brings 0.8% PQ improvement.
Applying both dual-decoder and dual-ASPP, which gives us 0.7% PQ improvement.
Employing a large crop size 1025×2049 further improves the AP and mIoU by 0.6% and 0.9% respectively.

Finally, increasing the feature channels from 128 to 256 in the semantic segmentation branch achieves the best result of 63.0% PQ, 35.3% AP, and 80.5% mIoU.

2.2. Cityscapes

**Left:** **Cityscapes** **val set. Right:** **Cityscapes** **test set.**

Val Set (Left): When using only fine annotations, the best Panoptic-DeepLab, with multi-scale inputs and left-right flips, outperforms the best bottom-up approach, SSAP, by 3.0% PQ and 1.2% AP, and is better than the best proposal-based approach, AdaptIS, by 2.1% PQ, 2.2% AP, and 2.3% mIoU.
When using extra data, the best Panoptic-DeepLab outperforms UPSNet by 5.2% PQ, 3.5% AP, and 3.9% mIoU, and Seamless by 2.0% PQ and 2.4% mIoU.
Test Set (Right): The single unified Panoptic-DeepLab achieves state-of-the-art results, ranking first at all three Cityscapes tasks, when comparing with published works.