Review: Semantic FPN, Panoptic FPN — Panoptic Feature Pyramid Networks

Mask R-CNN + FPN: Semantic+Instance+Panoptic Segmentation

6 min readApr 17, 2022

**Panoptic FPN results on COCO (top) and Cityscapes (bottom) using a single** **ResNet-101-FPN** **network.**

Panoptic Feature Pyramid Networks
Semantic FPN, Panoptic FPN, by Facebook AI Research (FAIR)
2019 CVPR, Over 400 Citations (Sik-Ho Tsang @ Medium)
Panoptic Segmentation, Semantic Segmentation, Instance Segmentation

Panoptic segmentation task unify the tasks of instance segmentation (for thing classes) and semantic segmentation (for stuff classes).
Panoptic Feature Pyramid Networks (Panoptic FPN) aims to unify these methods at the architectural level, designing a single network for both tasks.
Panoptic FPN is to endow Mask R-CNN, a popular instance segmentation method, with a semantic segmentation branch using a shared Feature Pyramid Network (FPN) backbone.
The semantic FPN won the first place without ensembling on COCO Stuff Leaderboard.

Outline

Panoptic Feature Pyramid Network (Panoptic FPN)
Other Details
Experimental Results

1. Panoptic Feature Pyramid Network (Panoptic FPN)

**(a)** **FPN, (b)** **Mask R-CNN, (c) Panoptic FPN: This simple extension of** **Mask R-CNN** **with** **FPN**

1.1. (a) FPN

Feature Pyramid Network (FPN) takes a standard network (ResNet) with features at multiple spatial resolutions, and adds a light top-down pathway with lateral connections.
The top-down pathway starts from the deepest layer of the network and progressively upsamples it while adding in transformed versions of higher-resolution features from the bottom-up pathway.
FPN generates a pyramid, typically with scales from 1/32 to 1/4 resolution, where each pyramid level has the same channel dimension (256 by default).

1.2. (b) Mask R-CNN

Mask R-CNN extends Faster R-CNN by adding an FCN branch to predict a binary segmentation mask for each candidate region.

1.3. (c) Panoptic FPN

As mentioned, the aim is to modify Mask R-CNN with FPN to enable pixel-wise semantic segmentation prediction. A simple design is to merge the information from all levels of the FPN pyramid into a single output.

Starting from the deepest FPN level (at 1/32 scale), 3 upsampling stages are performed to yield a feature map at 1/4 scale, where each upsampling stage consists of 3×3 convolution, Group Norm, ReLU, and 2 bilinear upsampling. This strategy is repeated for FPN scales 1/16, 1/8, and 1/4.
The result is a set of feature maps at the same 1/4 scale, which are then element-wise summed. A final 1×1 convolution, 4 bilinear upsampling, and softmax are used to generate the per-pixel class labels at the original image resolution. In addition to stuff classes, this branch also outputs a special ‘other’ class for all pixels belonging to objects (to avoid predicting stuff classes for such pixels).
A standard FPN configuration with 256 output channels per scale, and the proposed semantic segmentation branch reduces this to 128 channels.
For the (pre-FPN) backbone, ImageNet-pretrained ResNet/ResNeXt with BN is used.

2. Other Details

2.1. Non-Maximum Suppression (NMS)

A simple non-maximum suppression (NMS) postprocessing, proposed in PS, is performed to resolve all overlaps by:

resolving overlaps between different instances based on their confidence scores;
resolving overlaps between instance and semantic segmentation outputs in favor of instances; and
removing any stuff regions labeled ‘other’ or under a given area threshold.

2.2. Joint Training

During training the instance segmentation branch has three losses: Lc (classification loss), Lb (bounding-box loss), and Lm (mask loss), where Lc and Lb are normalized by the number of sampled RoIs and Lm is normalized by the number of foreground RoIs.
For semantic segmentation, the semantic segmentation loss, Ls, is computed as a per-pixel cross entropy loss between the predicted and the ground-truth labels, normalized by the number of labeled image pixels.

A simple loss re-weighting is used between the total instance segmentation loss and the semantic segmentation loss:

By tuning λi and λs, it is possible to train a single model.

2.3. Architecture Analysis

**Backbone architectures for increasing feature resolution.**

(a): A standard convolutional network.
(b): A common approach is to reduce the stride of select convolutions and use dilated convolutions after to compensate.
(c): A U-Net style network uses a symmetric decoder that mirrors the bottom-up pathway, but in reverse.
(d): FPN can be seen as an asymmetric, lightweight decoder whose top-down pathway has only one block per stage and uses a shared channel dimension.

FPN is much lighter than a typically used dilation-8 network, ~2× more efficient than the symmetric encoder-decoder, and roughly equivalent to a dilation-16 network (while producing a 4× higher resolution output).

3. Experimental Results

3.1. Semantic Segmentation

Semantic FPN is lighter than typical dilation models, while yielding higher resolution features.

The proposed Semantic FPN is comparable to state-of-the-art methods in accuracy and efficiency.

This proposed entry won first place without ensembling, and we outperformed competing methods by at least a 2 point margin on all reported metrics.

**Left: Ablation on channel width, Right: Ablation on Feature map aggregation method (mIoU)**

Left: ResNet-50 Semantic FPN with varying number of channels in the semantic branch. It is found that 128 strikes a good balance between accuracy and efficiency.
Right: While accuracy for both is comparable, summation is more efficient.

3.2. Panoptic Segmentation

(For details of PQ, RQ and SQ, please feel free to read PS.)

**Left: Panoptic Segmentation: Panoptic R50-FPN vs. R50-FPN**×**2, Right: Panoptic Segmentation: Panoptic R101-FPN vs. R50-FPN**×2

Left: Using a single FPN network for solving both tasks simultaneously yields comparable accuracy to two independent FPN networks for instance and semantic segmentation, but with roughly half the compute.
Right: Given a roughly equal computational budget, a single FPN network for the panoptic task outperforms two independent FPN networks for instance and semantic segmentation by a healthy margin.

**Left: Training Panoptic FPN, Right: Grouped FPN**

Left: During training, for each minibatch we can either combine the semantic and instances loss or we can alternate which loss we compute. Combining the losses in each minibatch performs much better.
Right: Group the 256 FPN channels into two sets and apply the instance and semantic branch to its own dedicated group of 128.
While this gives mixed gains, it is expected that better multi-task strategies (original) can improve results.