Review — Panoptic-DeepLab: A Simple, Strong, and Fast Baseline for Bottom-Up Panoptic Segmentation
Panoptic Segmentation Using Semantic Segmentation & Instance Segmentation
Panoptic-DeepLab: A Simple, Strong, and Fast Baseline for Bottom-Up Panoptic Segmentation,
Panoptic-DeepLab, by UIUC, & Google Research,
2020 CVPR, Over 350 Citations (Sik-Ho Tsang @ Medium)
Panoptic Segmentation, Semantic Segmentation, Instance Segmentation
- Panoptic-DeepLab is proposed, which adopts the dual-ASPP and dual-decoder structures specific to semantic, and instance segmentation, respectively.
- The semantic segmentation branch is the same as the typical design of any semantic segmentation model (e.g., DeepLab), while the instance segmentation branch is class-agnostic, involving a simple instance center regression.
1.1. Model Architecture
- The encoder backbone is an ImageNet-pretrained neural network paired with atrous convolution for extracting denser feature maps in its last block. Separate ASPP (DeepLabv2) and decoder modules are employed for semantic segmentation and instance segmentation.
- The light-weight decoder module follows DeepLabv3+ with two modifications: (1) an additional low-level feature with output stride 8 is introduced to the decoder, thus the spatial resolution is gradually recovered by a factor of 2, and (2) in each upsampling stage, a single 5×5 depthwise-separable convolution (MobileNetV1) is applied.
1.2. Semantic Segmentation Head
- A weighted bootstrapped cross entropy loss, DeeperLab, is used where it weights each pixel differently.
1.3. Class-Agnostic Instance Segmentation Head
- For every foreground pixel (i.e., pixel whose class is a ‘thing’), the offset to its corresponding mass center is further predicted. Ground-truth instance centers are encoded by a 2D Gaussian with standard deviation of 8 pixels.
- Mean Squared Error (MSE) loss is used to minimize the distance between predicted heatmaps and 2D Gaussian-encoded groundtruth heatmaps.
- L1 loss is used for the offset prediction, which is only activated at pixels belonging to object instances.
1.4. Panoptic Segmentation
- A highly efficient majority voting algorithm is used to merge semantic and instance segmentation into final panoptic segmentation.
- The instance id for the pixel is the index of the closest instance center after moving the pixel location (i, j) by the offset O(i, j):
- where ˆki,j is the predicted instance id. Semantic segmentation prediction is used to filter out ‘stuff’ pixels whose instance id are always set to 0.
- A fast and parallelizable method is used to merge the predicted semantic segmentation and class-agnostic instance segmentation results following the “majority vote” principle proposed in DeeperLab.
- The semantic label of a predicted instance mask is inferred by the majority vote of the corresponding predicted semantic labels.
1.5. Instance Segmentation
- The class-specific confidence score for each instance mask is:
- where Score(Objectness) is unnormalized objectness score obtained from the class-agnostic center point heatmap, and Score(Class) is obtained from the average of semantic segmentation predictions within the predicted mask region.
1.6. Loss Function
- Panoptic-DeepLab is trained with three loss functions: weighted bootstrapped cross entropy loss for semantic segmentation head (Lsem); MSE loss for center heatmap head (Lheatmap); and L1 loss for center offset head (Loffset):
- set λsem=3 for pixels belonging to instances with an area smaller than 64×64 and λsem=1 everywhere else, following DeeperLab.
- λheatmap=200 and λoffset=0.01.
2.1. Ablation Study
- MSE loss brings 0.8% PQ improvement.
- Applying both dual-decoder and dual-ASPP, which gives us 0.7% PQ improvement.
- Employing a large crop size 1025×2049 further improves the AP and mIoU by 0.6% and 0.9% respectively.
Finally, increasing the feature channels from 128 to 256 in the semantic segmentation branch achieves the best result of 63.0% PQ, 35.3% AP, and 80.5% mIoU.
- Val Set (Left): When using only fine annotations, the best Panoptic-DeepLab, with multi-scale inputs and left-right flips, outperforms the best bottom-up approach, SSAP, by 3.0% PQ and 1.2% AP, and is better than the best proposal-based approach, AdaptIS, by 2.1% PQ, 2.2% AP, and 2.3% mIoU.
- When using extra data, the best Panoptic-DeepLab outperforms UPSNet by 5.2% PQ, 3.5% AP, and 3.9% mIoU, and Seamless by 2.0% PQ and 2.4% mIoU.
- Test Set (Right): The single unified Panoptic-DeepLab achieves state-of-the-art results, ranking first at all three Cityscapes tasks, when comparing with published works.
2.3. Mapillary Vistas & COCO
Panoptic-DeepLab also obtains good results.
- (I skip the figures here for shorter story. Please feel free to read the paper directly for more details.)
2.4. Running Time
Panoptic-DeepLab achieves the best trade-off across all three datasets.
Only single scale inference is used here and the model achieves 37.7% PQ.
[2020 CVPR] [Panoptic-DeepLab]
Panoptic-DeepLab: A Simple, Strong, and Fast Baseline for Bottom-Up Panoptic Segmentation
1.5. Semantic Segmentation / Scene Parsing / Instance Segmentation / Panoptic Segmentation
2014–2020 [DRRN Zhang JNCA’20] [Trans10K, TransLab] [CCNet] [Open Images] [DETR] [Panoptic-DeepLab] 2021 [PVT, PVTv1] [SETR] [Trans10K-v2, Trans2Seg] [Copy-Paste] 2022 [PVTv2] [YOLACT++]